All of lore.kernel.org
 help / color / mirror / Atom feed
* Deadlock possibly caused by too_many_isolated.
@ 2010-09-14 23:11 ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-09-14 23:11 UTC (permalink / raw)
  To: Rik van Riel, Andrew Morton
  Cc: KOSAKI Motohiro, Wu, Fengguang <fengguang.wu,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm


Hi,

 I recently had a customer (running 2.6.32) report a deadlock during very
 intensive IO with lots of processes.

 Having looked at the stack traces, my guess as to the problem is this:

  There are enough threads in direct_reclaim that too_many_isolated() is
  returning true, and so some threads are blocked in shrink_inactive_list.

  Those threads that are not blocked there are attempting to do filesystem
  writeout.  But that is blocked because...

  Some threads that are blocked there, hold some IO lock (probably in the 
  filesystem) and are trying to allocate memory inside the block device
  (md/raid1 to be precise) which is allocating with GFP_NOIO and has a
  mempool to fall back on.
  As these threads don't have __GFP_IO set, they should not really be blocked
  both other threads that are doing IO.  But it seems they are.


  So I'm wondering if the loop in shrink_inactive_list should abort
  if __GFP_IO is not set ... and maybe if __GFP_FS is not set too???

  Below is a patch that I'm asking the customer to test.

  If anyone can point out a flaw in my reasoning, suggest any other
  alternatives, provide a better patch, or otherwise help me out here, I
  would greatly appreciate it.

  (I sent this email to the people mentioned in commit:
      commit 35cd78156c499ef83f60605e4643d5a98fef14fd
      Author: Rik van Riel <riel@redhat.com>
      Date:   Mon Sep 21 17:01:38 2009 -0700

          vmscan: throttle direct reclaim when too many pages are isolated already
  
   plus the obvious mail lists)

Thanks,
NeilBrown

Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
===================================================================
--- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
+++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
@@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
 			return SWAP_CLUSTER_MAX;
+		if (!(sc->gfp_mask & __GFP_IO))
+			/* Not allowed to do IO, so mustn't wait
+			 * on processes that might try to
+			 */
+			return SWAP_CLUSTER_MAX;
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Deadlock possibly caused by too_many_isolated.
@ 2010-09-14 23:11 ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-09-14 23:11 UTC (permalink / raw)
  To: Rik van Riel, Andrew Morton
  Cc: KOSAKI Motohiro, Wu, Fengguang <fengguang.wu,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm


Hi,

 I recently had a customer (running 2.6.32) report a deadlock during very
 intensive IO with lots of processes.

 Having looked at the stack traces, my guess as to the problem is this:

  There are enough threads in direct_reclaim that too_many_isolated() is
  returning true, and so some threads are blocked in shrink_inactive_list.

  Those threads that are not blocked there are attempting to do filesystem
  writeout.  But that is blocked because...

  Some threads that are blocked there, hold some IO lock (probably in the 
  filesystem) and are trying to allocate memory inside the block device
  (md/raid1 to be precise) which is allocating with GFP_NOIO and has a
  mempool to fall back on.
  As these threads don't have __GFP_IO set, they should not really be blocked
  both other threads that are doing IO.  But it seems they are.


  So I'm wondering if the loop in shrink_inactive_list should abort
  if __GFP_IO is not set ... and maybe if __GFP_FS is not set too???

  Below is a patch that I'm asking the customer to test.

  If anyone can point out a flaw in my reasoning, suggest any other
  alternatives, provide a better patch, or otherwise help me out here, I
  would greatly appreciate it.

  (I sent this email to the people mentioned in commit:
      commit 35cd78156c499ef83f60605e4643d5a98fef14fd
      Author: Rik van Riel <riel@redhat.com>
      Date:   Mon Sep 21 17:01:38 2009 -0700

          vmscan: throttle direct reclaim when too many pages are isolated already
  
   plus the obvious mail lists)

Thanks,
NeilBrown

Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
===================================================================
--- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
+++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
@@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
 			return SWAP_CLUSTER_MAX;
+		if (!(sc->gfp_mask & __GFP_IO))
+			/* Not allowed to do IO, so mustn't wait
+			 * on processes that might try to
+			 */
+			return SWAP_CLUSTER_MAX;
 	}
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-14 23:11 ` Neil Brown
@ 2010-09-15  0:30   ` Rik van Riel
  -1 siblings, 0 replies; 116+ messages in thread
From: Rik van Riel @ 2010-09-15  0:30 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, KOSAKI Motohiro, Wu Fengguang, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On 09/14/2010 07:11 PM, Neil Brown wrote:

> Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
>   		/* We are about to die and free our memory. Return now. */
>   		if (fatal_signal_pending(current))
>   			return SWAP_CLUSTER_MAX;
> +		if (!(sc->gfp_mask&  __GFP_IO))
> +			/* Not allowed to do IO, so mustn't wait
> +			 * on processes that might try to
> +			 */
> +			return SWAP_CLUSTER_MAX;
>   	}
>
>   	/*

Close.  We must also be sure that processes without __GFP_FS
set in their gfp_mask do not wait on processes that do have
__GFP_FS set.

Considering how many times we've run into a bug like this,
I'm kicking myself for not having thought of it :(

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  0:30   ` Rik van Riel
  0 siblings, 0 replies; 116+ messages in thread
From: Rik van Riel @ 2010-09-15  0:30 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, KOSAKI Motohiro, Wu Fengguang, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On 09/14/2010 07:11 PM, Neil Brown wrote:

> Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
>   		/* We are about to die and free our memory. Return now. */
>   		if (fatal_signal_pending(current))
>   			return SWAP_CLUSTER_MAX;
> +		if (!(sc->gfp_mask&  __GFP_IO))
> +			/* Not allowed to do IO, so mustn't wait
> +			 * on processes that might try to
> +			 */
> +			return SWAP_CLUSTER_MAX;
>   	}
>
>   	/*

Close.  We must also be sure that processes without __GFP_FS
set in their gfp_mask do not wait on processes that do have
__GFP_FS set.

Considering how many times we've run into a bug like this,
I'm kicking myself for not having thought of it :(

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  0:30   ` Rik van Riel
@ 2010-09-15  2:23     ` Neil Brown
  -1 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-09-15  2:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, KOSAKI Motohiro, Wu Fengguang, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Tue, 14 Sep 2010 20:30:18 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 09/14/2010 07:11 PM, Neil Brown wrote:
> 
> > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > ===================================================================
> > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> >   		/* We are about to die and free our memory. Return now. */
> >   		if (fatal_signal_pending(current))
> >   			return SWAP_CLUSTER_MAX;
> > +		if (!(sc->gfp_mask&  __GFP_IO))
> > +			/* Not allowed to do IO, so mustn't wait
> > +			 * on processes that might try to
> > +			 */
> > +			return SWAP_CLUSTER_MAX;
> >   	}
> >
> >   	/*
> 
> Close.  We must also be sure that processes without __GFP_FS
> set in their gfp_mask do not wait on processes that do have
> __GFP_FS set.
> 
> Considering how many times we've run into a bug like this,
> I'm kicking myself for not having thought of it :(
> 

So maybe this?  I've added the test for __GFP_FS, and moved the test before
the congestion_wait on the basis that we really want to get back up the stack
and try the mempool ASAP.

Thanks,
NeilBrown



From: NeilBrown <neilb@suse.de>

mm: Avoid possible deadlock caused by too_many_isolated()


If too_many_isolated() returns true while performing direct reclaim we can
end up waiting for other threads to complete their direct reclaim.
If those threads are allowed to enter the FS or IO to free memory, but
this thread is not, then it is possible that those threads will be waiting on
this thread and so we get a circular deadlock.

So: if too_many_isolated() returns true when the allocation did not permit FS
or IO, fail shrink_inactive_list rather than blocking.

Signed-off-by: NeilBrown <neilb@suse.de>

--- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
+++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 12:17:16.000000000 +1000
@@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
 	int lumpy_reclaim = 0;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
+		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
+			/* Not allowed to do IO, so mustn't wait
+			 * on processes that might try to
+			 */
+			return SWAP_CLUSTER_MAX;
+
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  2:23     ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-09-15  2:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, KOSAKI Motohiro, Wu Fengguang, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Tue, 14 Sep 2010 20:30:18 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 09/14/2010 07:11 PM, Neil Brown wrote:
> 
> > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > ===================================================================
> > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> >   		/* We are about to die and free our memory. Return now. */
> >   		if (fatal_signal_pending(current))
> >   			return SWAP_CLUSTER_MAX;
> > +		if (!(sc->gfp_mask&  __GFP_IO))
> > +			/* Not allowed to do IO, so mustn't wait
> > +			 * on processes that might try to
> > +			 */
> > +			return SWAP_CLUSTER_MAX;
> >   	}
> >
> >   	/*
> 
> Close.  We must also be sure that processes without __GFP_FS
> set in their gfp_mask do not wait on processes that do have
> __GFP_FS set.
> 
> Considering how many times we've run into a bug like this,
> I'm kicking myself for not having thought of it :(
> 

So maybe this?  I've added the test for __GFP_FS, and moved the test before
the congestion_wait on the basis that we really want to get back up the stack
and try the mempool ASAP.

Thanks,
NeilBrown



From: NeilBrown <neilb@suse.de>

mm: Avoid possible deadlock caused by too_many_isolated()


If too_many_isolated() returns true while performing direct reclaim we can
end up waiting for other threads to complete their direct reclaim.
If those threads are allowed to enter the FS or IO to free memory, but
this thread is not, then it is possible that those threads will be waiting on
this thread and so we get a circular deadlock.

So: if too_many_isolated() returns true when the allocation did not permit FS
or IO, fail shrink_inactive_list rather than blocking.

Signed-off-by: NeilBrown <neilb@suse.de>

--- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
+++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 12:17:16.000000000 +1000
@@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
 	int lumpy_reclaim = 0;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
+		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
+			/* Not allowed to do IO, so mustn't wait
+			 * on processes that might try to
+			 */
+			return SWAP_CLUSTER_MAX;
+
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  2:23     ` Neil Brown
@ 2010-09-15  2:37       ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  2:37 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> On Tue, 14 Sep 2010 20:30:18 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > 
> > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > ===================================================================
> > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > >   		/* We are about to die and free our memory. Return now. */
> > >   		if (fatal_signal_pending(current))
> > >   			return SWAP_CLUSTER_MAX;
> > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > +			/* Not allowed to do IO, so mustn't wait
> > > +			 * on processes that might try to
> > > +			 */
> > > +			return SWAP_CLUSTER_MAX;
> > >   	}
> > >
> > >   	/*
> > 
> > Close.  We must also be sure that processes without __GFP_FS
> > set in their gfp_mask do not wait on processes that do have
> > __GFP_FS set.
> > 
> > Considering how many times we've run into a bug like this,
> > I'm kicking myself for not having thought of it :(
> > 
> 
> So maybe this?  I've added the test for __GFP_FS, and moved the test before
> the congestion_wait on the basis that we really want to get back up the stack
> and try the mempool ASAP.

The patch may well fail the !__GFP_IO page allocation and then
quickly exhaust the mempool.

Another approach may to let too_many_isolated() use much higher
thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
allow at least nr2 NOIO/FS tasks to be blocked independent of the
IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
very hard to accumulate enough NOIO processes to be actually blocked.


                  IO/FS tasks                NOIO/FS tasks           full
                  block here                 block here              LRU size
|-----------------|--------------------------|-----------------------|
|      nr1        |           nr2            |


Thanks,
Fengguang

> 
> From: NeilBrown <neilb@suse.de>
> 
> mm: Avoid possible deadlock caused by too_many_isolated()
> 
> 
> If too_many_isolated() returns true while performing direct reclaim we can
> end up waiting for other threads to complete their direct reclaim.
> If those threads are allowed to enter the FS or IO to free memory, but
> this thread is not, then it is possible that those threads will be waiting on
> this thread and so we get a circular deadlock.
> 
> So: if too_many_isolated() returns true when the allocation did not permit FS
> or IO, fail shrink_inactive_list rather than blocking.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 12:17:16.000000000 +1000
> @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
>  	int lumpy_reclaim = 0;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
> +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> +			/* Not allowed to do IO, so mustn't wait
> +			 * on processes that might try to
> +			 */
> +			return SWAP_CLUSTER_MAX;
> +
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/* We are about to die and free our memory. Return now. */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  2:37       ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  2:37 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> On Tue, 14 Sep 2010 20:30:18 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > 
> > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > ===================================================================
> > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > >   		/* We are about to die and free our memory. Return now. */
> > >   		if (fatal_signal_pending(current))
> > >   			return SWAP_CLUSTER_MAX;
> > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > +			/* Not allowed to do IO, so mustn't wait
> > > +			 * on processes that might try to
> > > +			 */
> > > +			return SWAP_CLUSTER_MAX;
> > >   	}
> > >
> > >   	/*
> > 
> > Close.  We must also be sure that processes without __GFP_FS
> > set in their gfp_mask do not wait on processes that do have
> > __GFP_FS set.
> > 
> > Considering how many times we've run into a bug like this,
> > I'm kicking myself for not having thought of it :(
> > 
> 
> So maybe this?  I've added the test for __GFP_FS, and moved the test before
> the congestion_wait on the basis that we really want to get back up the stack
> and try the mempool ASAP.

The patch may well fail the !__GFP_IO page allocation and then
quickly exhaust the mempool.

Another approach may to let too_many_isolated() use much higher
thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
allow at least nr2 NOIO/FS tasks to be blocked independent of the
IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
very hard to accumulate enough NOIO processes to be actually blocked.


                  IO/FS tasks                NOIO/FS tasks           full
                  block here                 block here              LRU size
|-----------------|--------------------------|-----------------------|
|      nr1        |           nr2            |


Thanks,
Fengguang

> 
> From: NeilBrown <neilb@suse.de>
> 
> mm: Avoid possible deadlock caused by too_many_isolated()
> 
> 
> If too_many_isolated() returns true while performing direct reclaim we can
> end up waiting for other threads to complete their direct reclaim.
> If those threads are allowed to enter the FS or IO to free memory, but
> this thread is not, then it is possible that those threads will be waiting on
> this thread and so we get a circular deadlock.
> 
> So: if too_many_isolated() returns true when the allocation did not permit FS
> or IO, fail shrink_inactive_list rather than blocking.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 12:17:16.000000000 +1000
> @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
>  	int lumpy_reclaim = 0;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
> +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> +			/* Not allowed to do IO, so mustn't wait
> +			 * on processes that might try to
> +			 */
> +			return SWAP_CLUSTER_MAX;
> +
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/* We are about to die and free our memory. Return now. */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  2:37       ` Wu Fengguang
@ 2010-09-15  2:54         ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  2:54 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > On Tue, 14 Sep 2010 20:30:18 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> > 
> > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > 
> > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > ===================================================================
> > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > >   		/* We are about to die and free our memory. Return now. */
> > > >   		if (fatal_signal_pending(current))
> > > >   			return SWAP_CLUSTER_MAX;
> > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > +			/* Not allowed to do IO, so mustn't wait
> > > > +			 * on processes that might try to
> > > > +			 */
> > > > +			return SWAP_CLUSTER_MAX;
> > > >   	}
> > > >
> > > >   	/*
> > > 
> > > Close.  We must also be sure that processes without __GFP_FS
> > > set in their gfp_mask do not wait on processes that do have
> > > __GFP_FS set.
> > > 
> > > Considering how many times we've run into a bug like this,
> > > I'm kicking myself for not having thought of it :(
> > > 
> > 
> > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > the congestion_wait on the basis that we really want to get back up the stack
> > and try the mempool ASAP.
> 
> The patch may well fail the !__GFP_IO page allocation and then
> quickly exhaust the mempool.
> 
> Another approach may to let too_many_isolated() use much higher
> thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> allow at least nr2 NOIO/FS tasks to be blocked independent of the
> IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> very hard to accumulate enough NOIO processes to be actually blocked.
> 
> 
>                   IO/FS tasks                NOIO/FS tasks           full
>                   block here                 block here              LRU size
> |-----------------|--------------------------|-----------------------|
> |      nr1        |           nr2            |

How about this fix? We may need very high threshold for NOIO/NOFS to
prevent possible regressions.

Thanks,
Fengguang
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 225a759..5e116cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
@@ -1150,7 +1151,9 @@ static int too_many_isolated(struct zone *zone, int file,
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = sc->gfp_mask & (__GFP_FS|__GFP_IO) ? 1 : 8;
+
+	return isolated > inactive * ratio;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  2:54         ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  2:54 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > On Tue, 14 Sep 2010 20:30:18 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> > 
> > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > 
> > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > ===================================================================
> > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > >   		/* We are about to die and free our memory. Return now. */
> > > >   		if (fatal_signal_pending(current))
> > > >   			return SWAP_CLUSTER_MAX;
> > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > +			/* Not allowed to do IO, so mustn't wait
> > > > +			 * on processes that might try to
> > > > +			 */
> > > > +			return SWAP_CLUSTER_MAX;
> > > >   	}
> > > >
> > > >   	/*
> > > 
> > > Close.  We must also be sure that processes without __GFP_FS
> > > set in their gfp_mask do not wait on processes that do have
> > > __GFP_FS set.
> > > 
> > > Considering how many times we've run into a bug like this,
> > > I'm kicking myself for not having thought of it :(
> > > 
> > 
> > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > the congestion_wait on the basis that we really want to get back up the stack
> > and try the mempool ASAP.
> 
> The patch may well fail the !__GFP_IO page allocation and then
> quickly exhaust the mempool.
> 
> Another approach may to let too_many_isolated() use much higher
> thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> allow at least nr2 NOIO/FS tasks to be blocked independent of the
> IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> very hard to accumulate enough NOIO processes to be actually blocked.
> 
> 
>                   IO/FS tasks                NOIO/FS tasks           full
>                   block here                 block here              LRU size
> |-----------------|--------------------------|-----------------------|
> |      nr1        |           nr2            |

How about this fix? We may need very high threshold for NOIO/NOFS to
prevent possible regressions.

Thanks,
Fengguang
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 225a759..5e116cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
@@ -1150,7 +1151,9 @@ static int too_many_isolated(struct zone *zone, int file,
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = sc->gfp_mask & (__GFP_FS|__GFP_IO) ? 1 : 8;
+
+	return isolated > inactive * ratio;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  2:54         ` Wu Fengguang
@ 2010-09-15  3:06           ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  3:06 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > Rik van Riel <riel@redhat.com> wrote:
> > > 
> > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > 
> > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > ===================================================================
> > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > >   		/* We are about to die and free our memory. Return now. */
> > > > >   		if (fatal_signal_pending(current))
> > > > >   			return SWAP_CLUSTER_MAX;
> > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > +			 * on processes that might try to
> > > > > +			 */
> > > > > +			return SWAP_CLUSTER_MAX;
> > > > >   	}
> > > > >
> > > > >   	/*
> > > > 
> > > > Close.  We must also be sure that processes without __GFP_FS
> > > > set in their gfp_mask do not wait on processes that do have
> > > > __GFP_FS set.
> > > > 
> > > > Considering how many times we've run into a bug like this,
> > > > I'm kicking myself for not having thought of it :(
> > > > 
> > > 
> > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > the congestion_wait on the basis that we really want to get back up the stack
> > > and try the mempool ASAP.
> > 
> > The patch may well fail the !__GFP_IO page allocation and then
> > quickly exhaust the mempool.
> > 
> > Another approach may to let too_many_isolated() use much higher
> > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > very hard to accumulate enough NOIO processes to be actually blocked.
> > 
> > 
> >                   IO/FS tasks                NOIO/FS tasks           full
> >                   block here                 block here              LRU size
> > |-----------------|--------------------------|-----------------------|
> > |      nr1        |           nr2            |
> 
> How about this fix? We may need very high threshold for NOIO/NOFS to
> prevent possible regressions.

Plus __GFP_WAIT..

---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 225a759..6a896eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
@@ -1150,7 +1151,15 @@ static int too_many_isolated(struct zone *zone, int file,
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = 1;
+	if (!(sc->gfp_mask & (__GFP_FS)))
+		ratio <<= 1;
+	if (!(sc->gfp_mask & (__GFP_IO)))
+		ratio <<= 1;
+	if (!(sc->gfp_mask & (__GFP_WAIT)))
+		ratio <<= 1;
+
+	return isolated > inactive * ratio;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  3:06           ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  3:06 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > Rik van Riel <riel@redhat.com> wrote:
> > > 
> > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > 
> > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > ===================================================================
> > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > >   		/* We are about to die and free our memory. Return now. */
> > > > >   		if (fatal_signal_pending(current))
> > > > >   			return SWAP_CLUSTER_MAX;
> > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > +			 * on processes that might try to
> > > > > +			 */
> > > > > +			return SWAP_CLUSTER_MAX;
> > > > >   	}
> > > > >
> > > > >   	/*
> > > > 
> > > > Close.  We must also be sure that processes without __GFP_FS
> > > > set in their gfp_mask do not wait on processes that do have
> > > > __GFP_FS set.
> > > > 
> > > > Considering how many times we've run into a bug like this,
> > > > I'm kicking myself for not having thought of it :(
> > > > 
> > > 
> > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > the congestion_wait on the basis that we really want to get back up the stack
> > > and try the mempool ASAP.
> > 
> > The patch may well fail the !__GFP_IO page allocation and then
> > quickly exhaust the mempool.
> > 
> > Another approach may to let too_many_isolated() use much higher
> > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > very hard to accumulate enough NOIO processes to be actually blocked.
> > 
> > 
> >                   IO/FS tasks                NOIO/FS tasks           full
> >                   block here                 block here              LRU size
> > |-----------------|--------------------------|-----------------------|
> > |      nr1        |           nr2            |
> 
> How about this fix? We may need very high threshold for NOIO/NOFS to
> prevent possible regressions.

Plus __GFP_WAIT..

---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 225a759..6a896eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
@@ -1150,7 +1151,15 @@ static int too_many_isolated(struct zone *zone, int file,
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = 1;
+	if (!(sc->gfp_mask & (__GFP_FS)))
+		ratio <<= 1;
+	if (!(sc->gfp_mask & (__GFP_IO)))
+		ratio <<= 1;
+	if (!(sc->gfp_mask & (__GFP_WAIT)))
+		ratio <<= 1;
+
+	return isolated > inactive * ratio;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  3:06           ` Wu Fengguang
@ 2010-09-15  3:13             ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  3:13 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 11:06:40AM +0800, Wu Fengguang wrote:
> On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > > Rik van Riel <riel@redhat.com> wrote:
> > > > 
> > > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > > 
> > > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > > ===================================================================
> > > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > > >   		/* We are about to die and free our memory. Return now. */
> > > > > >   		if (fatal_signal_pending(current))
> > > > > >   			return SWAP_CLUSTER_MAX;
> > > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > > +			 * on processes that might try to
> > > > > > +			 */
> > > > > > +			return SWAP_CLUSTER_MAX;
> > > > > >   	}
> > > > > >
> > > > > >   	/*
> > > > > 
> > > > > Close.  We must also be sure that processes without __GFP_FS
> > > > > set in their gfp_mask do not wait on processes that do have
> > > > > __GFP_FS set.
> > > > > 
> > > > > Considering how many times we've run into a bug like this,
> > > > > I'm kicking myself for not having thought of it :(
> > > > > 
> > > > 
> > > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > > the congestion_wait on the basis that we really want to get back up the stack
> > > > and try the mempool ASAP.
> > > 
> > > The patch may well fail the !__GFP_IO page allocation and then
> > > quickly exhaust the mempool.
> > > 
> > > Another approach may to let too_many_isolated() use much higher
> > > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > > very hard to accumulate enough NOIO processes to be actually blocked.
> > > 
> > > 
> > >                   IO/FS tasks                NOIO/FS tasks           full
> > >                   block here                 block here              LRU size
> > > |-----------------|--------------------------|-----------------------|
> > > |      nr1        |           nr2            |
> > 
> > How about this fix? We may need very high threshold for NOIO/NOFS to
> > prevent possible regressions.
> 
> Plus __GFP_WAIT..

Ah sorry! __GFP_WAIT cannot afford to wait by definition..

---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 225a759..becc63a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1135,10 +1135,14 @@ static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
 
+	if (!(sc->gfp_mask & __GFP_WAIT))
+		return 0;
+
 	if (!scanning_global_lru(sc))
 		return 0;
 
@@ -1150,7 +1154,9 @@ static int too_many_isolated(struct zone *zone, int file,
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
+
+	return isolated > inactive * ratio;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  3:13             ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  3:13 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 11:06:40AM +0800, Wu Fengguang wrote:
> On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > > Rik van Riel <riel@redhat.com> wrote:
> > > > 
> > > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > > 
> > > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > > ===================================================================
> > > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > > >   		/* We are about to die and free our memory. Return now. */
> > > > > >   		if (fatal_signal_pending(current))
> > > > > >   			return SWAP_CLUSTER_MAX;
> > > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > > +			 * on processes that might try to
> > > > > > +			 */
> > > > > > +			return SWAP_CLUSTER_MAX;
> > > > > >   	}
> > > > > >
> > > > > >   	/*
> > > > > 
> > > > > Close.  We must also be sure that processes without __GFP_FS
> > > > > set in their gfp_mask do not wait on processes that do have
> > > > > __GFP_FS set.
> > > > > 
> > > > > Considering how many times we've run into a bug like this,
> > > > > I'm kicking myself for not having thought of it :(
> > > > > 
> > > > 
> > > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > > the congestion_wait on the basis that we really want to get back up the stack
> > > > and try the mempool ASAP.
> > > 
> > > The patch may well fail the !__GFP_IO page allocation and then
> > > quickly exhaust the mempool.
> > > 
> > > Another approach may to let too_many_isolated() use much higher
> > > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > > very hard to accumulate enough NOIO processes to be actually blocked.
> > > 
> > > 
> > >                   IO/FS tasks                NOIO/FS tasks           full
> > >                   block here                 block here              LRU size
> > > |-----------------|--------------------------|-----------------------|
> > > |      nr1        |           nr2            |
> > 
> > How about this fix? We may need very high threshold for NOIO/NOFS to
> > prevent possible regressions.
> 
> Plus __GFP_WAIT..

Ah sorry! __GFP_WAIT cannot afford to wait by definition..

---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 225a759..becc63a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1135,10 +1135,14 @@ static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
 
+	if (!(sc->gfp_mask & __GFP_WAIT))
+		return 0;
+
 	if (!scanning_global_lru(sc))
 		return 0;
 
@@ -1150,7 +1154,9 @@ static int too_many_isolated(struct zone *zone, int file,
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
+
+	return isolated > inactive * ratio;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  3:06           ` Wu Fengguang
@ 2010-09-15  3:17             ` Neil Brown
  -1 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-09-15  3:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, 15 Sep 2010 11:06:40 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > > Rik van Riel <riel@redhat.com> wrote:
> > > > 
> > > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > > 
> > > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > > ===================================================================
> > > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > > >   		/* We are about to die and free our memory. Return now. */
> > > > > >   		if (fatal_signal_pending(current))
> > > > > >   			return SWAP_CLUSTER_MAX;
> > > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > > +			 * on processes that might try to
> > > > > > +			 */
> > > > > > +			return SWAP_CLUSTER_MAX;
> > > > > >   	}
> > > > > >
> > > > > >   	/*
> > > > > 
> > > > > Close.  We must also be sure that processes without __GFP_FS
> > > > > set in their gfp_mask do not wait on processes that do have
> > > > > __GFP_FS set.
> > > > > 
> > > > > Considering how many times we've run into a bug like this,
> > > > > I'm kicking myself for not having thought of it :(
> > > > > 
> > > > 
> > > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > > the congestion_wait on the basis that we really want to get back up the stack
> > > > and try the mempool ASAP.
> > > 
> > > The patch may well fail the !__GFP_IO page allocation and then
> > > quickly exhaust the mempool.
> > > 
> > > Another approach may to let too_many_isolated() use much higher
> > > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > > very hard to accumulate enough NOIO processes to be actually blocked.
> > > 
> > > 
> > >                   IO/FS tasks                NOIO/FS tasks           full
> > >                   block here                 block here              LRU size
> > > |-----------------|--------------------------|-----------------------|
> > > |      nr1        |           nr2            |
> > 
> > How about this fix? We may need very high threshold for NOIO/NOFS to
> > prevent possible regressions.
> 
> Plus __GFP_WAIT..
> 
> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 225a759..6a896eb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
>  		struct scan_control *sc)
>  {
>  	unsigned long inactive, isolated;
> +	int ratio;
>  
>  	if (current_is_kswapd())
>  		return 0;
> @@ -1150,7 +1151,15 @@ static int too_many_isolated(struct zone *zone, int file,
>  		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
>  	}
>  
> -	return isolated > inactive;
> +	ratio = 1;
> +	if (!(sc->gfp_mask & (__GFP_FS)))
> +		ratio <<= 1;
> +	if (!(sc->gfp_mask & (__GFP_IO)))
> +		ratio <<= 1;
> +	if (!(sc->gfp_mask & (__GFP_WAIT)))
> +		ratio <<= 1;
> +
> +	return isolated > inactive * ratio;
>  }
>  
>  /*


Are you suggesting this instead of my patch, or as well as my patch?

Because while I think it sounds like a good idea I don't think it actually
removes the chance of a deadlock, just makes it a lot less likely.
So I think your patch combined with my patch would be a good total solution.

Do you agree?

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  3:17             ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-09-15  3:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, 15 Sep 2010 11:06:40 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > > Rik van Riel <riel@redhat.com> wrote:
> > > > 
> > > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > > 
> > > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > > ===================================================================
> > > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > > >   		/* We are about to die and free our memory. Return now. */
> > > > > >   		if (fatal_signal_pending(current))
> > > > > >   			return SWAP_CLUSTER_MAX;
> > > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > > +			 * on processes that might try to
> > > > > > +			 */
> > > > > > +			return SWAP_CLUSTER_MAX;
> > > > > >   	}
> > > > > >
> > > > > >   	/*
> > > > > 
> > > > > Close.  We must also be sure that processes without __GFP_FS
> > > > > set in their gfp_mask do not wait on processes that do have
> > > > > __GFP_FS set.
> > > > > 
> > > > > Considering how many times we've run into a bug like this,
> > > > > I'm kicking myself for not having thought of it :(
> > > > > 
> > > > 
> > > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > > the congestion_wait on the basis that we really want to get back up the stack
> > > > and try the mempool ASAP.
> > > 
> > > The patch may well fail the !__GFP_IO page allocation and then
> > > quickly exhaust the mempool.
> > > 
> > > Another approach may to let too_many_isolated() use much higher
> > > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > > very hard to accumulate enough NOIO processes to be actually blocked.
> > > 
> > > 
> > >                   IO/FS tasks                NOIO/FS tasks           full
> > >                   block here                 block here              LRU size
> > > |-----------------|--------------------------|-----------------------|
> > > |      nr1        |           nr2            |
> > 
> > How about this fix? We may need very high threshold for NOIO/NOFS to
> > prevent possible regressions.
> 
> Plus __GFP_WAIT..
> 
> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 225a759..6a896eb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
>  		struct scan_control *sc)
>  {
>  	unsigned long inactive, isolated;
> +	int ratio;
>  
>  	if (current_is_kswapd())
>  		return 0;
> @@ -1150,7 +1151,15 @@ static int too_many_isolated(struct zone *zone, int file,
>  		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
>  	}
>  
> -	return isolated > inactive;
> +	ratio = 1;
> +	if (!(sc->gfp_mask & (__GFP_FS)))
> +		ratio <<= 1;
> +	if (!(sc->gfp_mask & (__GFP_IO)))
> +		ratio <<= 1;
> +	if (!(sc->gfp_mask & (__GFP_WAIT)))
> +		ratio <<= 1;
> +
> +	return isolated > inactive * ratio;
>  }
>  
>  /*


Are you suggesting this instead of my patch, or as well as my patch?

Because while I think it sounds like a good idea I don't think it actually
removes the chance of a deadlock, just makes it a lot less likely.
So I think your patch combined with my patch would be a good total solution.

Do you agree?

Thanks,
NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  3:13             ` Wu Fengguang
@ 2010-09-15  3:18               ` Shaohua Li
  -1 siblings, 0 replies; 116+ messages in thread
From: Shaohua Li @ 2010-09-15  3:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 2010-09-15 at 11:13 +0800, Wu Fengguang wrote:
> On Wed, Sep 15, 2010 at 11:06:40AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > > > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > > > Rik van Riel <riel@redhat.com> wrote:
> > > > > 
> > > > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > > > 
> > > > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > > > ===================================================================
> > > > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > > > >   		/* We are about to die and free our memory. Return now. */
> > > > > > >   		if (fatal_signal_pending(current))
> > > > > > >   			return SWAP_CLUSTER_MAX;
> > > > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > > > +			 * on processes that might try to
> > > > > > > +			 */
> > > > > > > +			return SWAP_CLUSTER_MAX;
> > > > > > >   	}
> > > > > > >
> > > > > > >   	/*
> > > > > > 
> > > > > > Close.  We must also be sure that processes without __GFP_FS
> > > > > > set in their gfp_mask do not wait on processes that do have
> > > > > > __GFP_FS set.
> > > > > > 
> > > > > > Considering how many times we've run into a bug like this,
> > > > > > I'm kicking myself for not having thought of it :(
> > > > > > 
> > > > > 
> > > > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > > > the congestion_wait on the basis that we really want to get back up the stack
> > > > > and try the mempool ASAP.
> > > > 
> > > > The patch may well fail the !__GFP_IO page allocation and then
> > > > quickly exhaust the mempool.
> > > > 
> > > > Another approach may to let too_many_isolated() use much higher
> > > > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > > > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > > > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > > > very hard to accumulate enough NOIO processes to be actually blocked.
> > > > 
> > > > 
> > > >                   IO/FS tasks                NOIO/FS tasks           full
> > > >                   block here                 block here              LRU size
> > > > |-----------------|--------------------------|-----------------------|
> > > > |      nr1        |           nr2            |
> > > 
> > > How about this fix? We may need very high threshold for NOIO/NOFS to
> > > prevent possible regressions.
> > 
> > Plus __GFP_WAIT..
> 
> Ah sorry! __GFP_WAIT cannot afford to wait by definition..
> 
> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 225a759..becc63a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1135,10 +1135,14 @@ static int too_many_isolated(struct zone *zone, int file,
>  		struct scan_control *sc)
>  {
>  	unsigned long inactive, isolated;
> +	int ratio;
>  
>  	if (current_is_kswapd())
>  		return 0;
>  
> +	if (!(sc->gfp_mask & __GFP_WAIT))
> +		return 0;
> +
it appears __GFP_WAIT allocation doesn't go to direct reclaim.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  3:18               ` Shaohua Li
  0 siblings, 0 replies; 116+ messages in thread
From: Shaohua Li @ 2010-09-15  3:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 2010-09-15 at 11:13 +0800, Wu Fengguang wrote:
> On Wed, Sep 15, 2010 at 11:06:40AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > > > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > > > Rik van Riel <riel@redhat.com> wrote:
> > > > > 
> > > > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > > > 
> > > > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > > > ===================================================================
> > > > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > > > >   		/* We are about to die and free our memory. Return now. */
> > > > > > >   		if (fatal_signal_pending(current))
> > > > > > >   			return SWAP_CLUSTER_MAX;
> > > > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > > > +			 * on processes that might try to
> > > > > > > +			 */
> > > > > > > +			return SWAP_CLUSTER_MAX;
> > > > > > >   	}
> > > > > > >
> > > > > > >   	/*
> > > > > > 
> > > > > > Close.  We must also be sure that processes without __GFP_FS
> > > > > > set in their gfp_mask do not wait on processes that do have
> > > > > > __GFP_FS set.
> > > > > > 
> > > > > > Considering how many times we've run into a bug like this,
> > > > > > I'm kicking myself for not having thought of it :(
> > > > > > 
> > > > > 
> > > > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > > > the congestion_wait on the basis that we really want to get back up the stack
> > > > > and try the mempool ASAP.
> > > > 
> > > > The patch may well fail the !__GFP_IO page allocation and then
> > > > quickly exhaust the mempool.
> > > > 
> > > > Another approach may to let too_many_isolated() use much higher
> > > > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > > > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > > > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > > > very hard to accumulate enough NOIO processes to be actually blocked.
> > > > 
> > > > 
> > > >                   IO/FS tasks                NOIO/FS tasks           full
> > > >                   block here                 block here              LRU size
> > > > |-----------------|--------------------------|-----------------------|
> > > > |      nr1        |           nr2            |
> > > 
> > > How about this fix? We may need very high threshold for NOIO/NOFS to
> > > prevent possible regressions.
> > 
> > Plus __GFP_WAIT..
> 
> Ah sorry! __GFP_WAIT cannot afford to wait by definition..
> 
> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 225a759..becc63a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1135,10 +1135,14 @@ static int too_many_isolated(struct zone *zone, int file,
>  		struct scan_control *sc)
>  {
>  	unsigned long inactive, isolated;
> +	int ratio;
>  
>  	if (current_is_kswapd())
>  		return 0;
>  
> +	if (!(sc->gfp_mask & __GFP_WAIT))
> +		return 0;
> +
it appears __GFP_WAIT allocation doesn't go to direct reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  3:18               ` Shaohua Li
@ 2010-09-15  3:31                 ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  3:31 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 11:18:32AM +0800, Li, Shaohua wrote:

> > +	if (!(sc->gfp_mask & __GFP_WAIT))
> > +		return 0;
> > +
> it appears __GFP_WAIT allocation doesn't go to direct reclaim.

Good point! So we are returning to its very first version ;)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
@@ -1150,7 +1151,9 @@ static int too_many_isolated(struct zone *zone, int file,
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
+
+	return isolated > inactive * ratio;
 }
 
 /*

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  3:31                 ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  3:31 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 11:18:32AM +0800, Li, Shaohua wrote:

> > +	if (!(sc->gfp_mask & __GFP_WAIT))
> > +		return 0;
> > +
> it appears __GFP_WAIT allocation doesn't go to direct reclaim.

Good point! So we are returning to its very first version ;)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
@@ -1150,7 +1151,9 @@ static int too_many_isolated(struct zone *zone, int file,
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
+
+	return isolated > inactive * ratio;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  3:17             ` Neil Brown
@ 2010-09-15  3:47               ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  3:47 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 11:17:35AM +0800, Neil Brown wrote:
> On Wed, 15 Sep 2010 11:06:40 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > > > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > > > Rik van Riel <riel@redhat.com> wrote:
> > > > > 
> > > > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > > > 
> > > > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > > > ===================================================================
> > > > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > > > >   		/* We are about to die and free our memory. Return now. */
> > > > > > >   		if (fatal_signal_pending(current))
> > > > > > >   			return SWAP_CLUSTER_MAX;
> > > > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > > > +			 * on processes that might try to
> > > > > > > +			 */
> > > > > > > +			return SWAP_CLUSTER_MAX;
> > > > > > >   	}
> > > > > > >
> > > > > > >   	/*
> > > > > > 
> > > > > > Close.  We must also be sure that processes without __GFP_FS
> > > > > > set in their gfp_mask do not wait on processes that do have
> > > > > > __GFP_FS set.
> > > > > > 
> > > > > > Considering how many times we've run into a bug like this,
> > > > > > I'm kicking myself for not having thought of it :(
> > > > > > 
> > > > > 
> > > > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > > > the congestion_wait on the basis that we really want to get back up the stack
> > > > > and try the mempool ASAP.
> > > > 
> > > > The patch may well fail the !__GFP_IO page allocation and then
> > > > quickly exhaust the mempool.
> > > > 
> > > > Another approach may to let too_many_isolated() use much higher
> > > > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > > > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > > > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > > > very hard to accumulate enough NOIO processes to be actually blocked.
> > > > 
> > > > 
> > > >                   IO/FS tasks                NOIO/FS tasks           full
> > > >                   block here                 block here              LRU size
> > > > |-----------------|--------------------------|-----------------------|
> > > > |      nr1        |           nr2            |
> > > 
> > > How about this fix? We may need very high threshold for NOIO/NOFS to
> > > prevent possible regressions.
> > 
> > Plus __GFP_WAIT..
> > 
> > ---
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 225a759..6a896eb 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
> >  		struct scan_control *sc)
> >  {
> >  	unsigned long inactive, isolated;
> > +	int ratio;
> >  
> >  	if (current_is_kswapd())
> >  		return 0;
> > @@ -1150,7 +1151,15 @@ static int too_many_isolated(struct zone *zone, int file,
> >  		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> >  	}
> >  
> > -	return isolated > inactive;
> > +	ratio = 1;
> > +	if (!(sc->gfp_mask & (__GFP_FS)))
> > +		ratio <<= 1;
> > +	if (!(sc->gfp_mask & (__GFP_IO)))
> > +		ratio <<= 1;
> > +	if (!(sc->gfp_mask & (__GFP_WAIT)))
> > +		ratio <<= 1;
> > +
> > +	return isolated > inactive * ratio;
> >  }
> >  
> >  /*
> 
> 
> Are you suggesting this instead of my patch, or as well as my patch?

Your patch surely breaks the deadlock, however might reintroduce the
old problem too_many_isolated() tried to address..

> Because while I think it sounds like a good idea I don't think it actually
> removes the chance of a deadlock, just makes it a lot less likely.
> So I think your patch combined with my patch would be a good total solution.

Deadlock means IO/FS tasks (blocked on FS lock) blocking the NOIO/FS
tasks? I think raising the threshold for NOIO/FS would be sufficient
to break the deadlock: The NOIO/FS tasks will be blocked simply
because there are so many NOIO/FS tasks competing with each other.
They do not inherently depend on the release of FS locks to proceed.

The too_many_isolated() was introduced initially to prevent OOM for
some fork-bomb workload, where no IO is involved (so no FS locks). If
removing the congestion wait for NOIO/FS tasks, the OOM may raise
again for the fork-bomb workload.

So I'd suggest to use sufficient high threshold for NOIO/FS, but still
limit the number of concurrent NOIO/FS allocations.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  3:47               ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  3:47 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Wed, Sep 15, 2010 at 11:17:35AM +0800, Neil Brown wrote:
> On Wed, 15 Sep 2010 11:06:40 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Wed, Sep 15, 2010 at 10:54:54AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 15, 2010 at 10:37:35AM +0800, Wu Fengguang wrote:
> > > > On Wed, Sep 15, 2010 at 10:23:34AM +0800, Neil Brown wrote:
> > > > > On Tue, 14 Sep 2010 20:30:18 -0400
> > > > > Rik van Riel <riel@redhat.com> wrote:
> > > > > 
> > > > > > On 09/14/2010 07:11 PM, Neil Brown wrote:
> > > > > > 
> > > > > > > Index: linux-2.6.32-SLE11-SP1/mm/vmscan.c
> > > > > > > ===================================================================
> > > > > > > --- linux-2.6.32-SLE11-SP1.orig/mm/vmscan.c	2010-09-15 08:37:32.000000000 +1000
> > > > > > > +++ linux-2.6.32-SLE11-SP1/mm/vmscan.c	2010-09-15 08:38:57.000000000 +1000
> > > > > > > @@ -1106,6 +1106,11 @@ static unsigned long shrink_inactive_lis
> > > > > > >   		/* We are about to die and free our memory. Return now. */
> > > > > > >   		if (fatal_signal_pending(current))
> > > > > > >   			return SWAP_CLUSTER_MAX;
> > > > > > > +		if (!(sc->gfp_mask&  __GFP_IO))
> > > > > > > +			/* Not allowed to do IO, so mustn't wait
> > > > > > > +			 * on processes that might try to
> > > > > > > +			 */
> > > > > > > +			return SWAP_CLUSTER_MAX;
> > > > > > >   	}
> > > > > > >
> > > > > > >   	/*
> > > > > > 
> > > > > > Close.  We must also be sure that processes without __GFP_FS
> > > > > > set in their gfp_mask do not wait on processes that do have
> > > > > > __GFP_FS set.
> > > > > > 
> > > > > > Considering how many times we've run into a bug like this,
> > > > > > I'm kicking myself for not having thought of it :(
> > > > > > 
> > > > > 
> > > > > So maybe this?  I've added the test for __GFP_FS, and moved the test before
> > > > > the congestion_wait on the basis that we really want to get back up the stack
> > > > > and try the mempool ASAP.
> > > > 
> > > > The patch may well fail the !__GFP_IO page allocation and then
> > > > quickly exhaust the mempool.
> > > > 
> > > > Another approach may to let too_many_isolated() use much higher
> > > > thresholds for !__GFP_IO/FS and lower ones for __GFP_IO/FS. ie. to
> > > > allow at least nr2 NOIO/FS tasks to be blocked independent of the
> > > > IO/FS ones.  Since NOIO vmscans typically completes fast, it will then
> > > > very hard to accumulate enough NOIO processes to be actually blocked.
> > > > 
> > > > 
> > > >                   IO/FS tasks                NOIO/FS tasks           full
> > > >                   block here                 block here              LRU size
> > > > |-----------------|--------------------------|-----------------------|
> > > > |      nr1        |           nr2            |
> > > 
> > > How about this fix? We may need very high threshold for NOIO/NOFS to
> > > prevent possible regressions.
> > 
> > Plus __GFP_WAIT..
> > 
> > ---
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 225a759..6a896eb 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1135,6 +1135,7 @@ static int too_many_isolated(struct zone *zone, int file,
> >  		struct scan_control *sc)
> >  {
> >  	unsigned long inactive, isolated;
> > +	int ratio;
> >  
> >  	if (current_is_kswapd())
> >  		return 0;
> > @@ -1150,7 +1151,15 @@ static int too_many_isolated(struct zone *zone, int file,
> >  		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> >  	}
> >  
> > -	return isolated > inactive;
> > +	ratio = 1;
> > +	if (!(sc->gfp_mask & (__GFP_FS)))
> > +		ratio <<= 1;
> > +	if (!(sc->gfp_mask & (__GFP_IO)))
> > +		ratio <<= 1;
> > +	if (!(sc->gfp_mask & (__GFP_WAIT)))
> > +		ratio <<= 1;
> > +
> > +	return isolated > inactive * ratio;
> >  }
> >  
> >  /*
> 
> 
> Are you suggesting this instead of my patch, or as well as my patch?

Your patch surely breaks the deadlock, however might reintroduce the
old problem too_many_isolated() tried to address..

> Because while I think it sounds like a good idea I don't think it actually
> removes the chance of a deadlock, just makes it a lot less likely.
> So I think your patch combined with my patch would be a good total solution.

Deadlock means IO/FS tasks (blocked on FS lock) blocking the NOIO/FS
tasks? I think raising the threshold for NOIO/FS would be sufficient
to break the deadlock: The NOIO/FS tasks will be blocked simply
because there are so many NOIO/FS tasks competing with each other.
They do not inherently depend on the release of FS locks to proceed.

The too_many_isolated() was introduced initially to prevent OOM for
some fork-bomb workload, where no IO is involved (so no FS locks). If
removing the congestion wait for NOIO/FS tasks, the OOM may raise
again for the fork-bomb workload.

So I'd suggest to use sufficient high threshold for NOIO/FS, but still
limit the number of concurrent NOIO/FS allocations.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  2:23     ` Neil Brown
@ 2010-09-15  8:28       ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  8:28 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li Shaohua

Neil,

Sorry for the rushed and imaginary ideas this morning..

> @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
>  	int lumpy_reclaim = 0;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
> +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> +			/* Not allowed to do IO, so mustn't wait
> +			 * on processes that might try to
> +			 */
> +			return SWAP_CLUSTER_MAX;
> +

The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
to cheat all the way up to believe "enough pages have been reclaimed".
So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
go on to call get_page_from_freelist(). That normally fails because
the task didn't really scanned the LRU lists. However it does have the
possibility to succeed -- when so many processes are doing concurrent
direct reclaims, it may luckily get one free page reclaimed by other
tasks. What's more, if it does fail to get a free page, the upper
layer __alloc_pages_slowpath() will be repeat recalling
__alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
"stealing" a free page reclaimed by other tasks.

In summary, the patch behavior for !__GFP_IO/FS is
- won't do any page reclaim
- won't fail the page allocation (unexpected)
- will wait and steal one free page from others (unreasonable)

So it will address the problem you encountered, however it sounds
pretty unexpected and illogical behavior, right?

I believe this patch will address the problem equally well.
What do you think?

Thanks,
Fengguang
---

mm: Avoid possible deadlock caused by too_many_isolated()

Neil finds that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim.  If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular
deadlock.

some task enters direct reclaim with GFP_KERNEL
  => too_many_isolated() false
    => vmscan and run into dirty pages
      => pageout()
        => take some FS lock
	  => fs/block code does GFP_NOIO allocation
	    => enter direct reclaim again
	      => too_many_isolated() true
		=> waiting for others to progress, however the other
		   tasks may be circular waiting for the FS lock..

The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by honouring them higher throttle threshold.

Now !__GFP_IO/FS reclaims won't be waiting for __GFP_IO/FS reclaims to
progress. They will be blocked only when there are too many concurrent
!__GFP_IO/FS reclaims, however that's very unlikely because the IO-less
direct reclaims is able to progress much more faster, and they won't
deadlock each other. The threshold is raised high enough for them, so
that there can be sufficient parallel progress of !__GFP_IO/FS reclaims.

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/vmscan.c	2010-09-15 11:58:58.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-09-15 15:36:14.000000000 +0800
@@ -1141,36 +1141,39 @@ int isolate_lru_page(struct page *page)
 	return ret;
 }
 
 /*
  * Are there way too many processes in the direct reclaim path already?
  */
 static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
 
 	if (!scanning_global_lru(sc))
 		return 0;
 
 	if (file) {
 		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
 		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
 	} else {
 		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
+
+	return isolated > inactive * ratio;
 }
 
 /*
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void
 putback_lru_pages(struct zone *zone, struct scan_control *sc,
 				unsigned long nr_anon, unsigned long nr_file,
 				struct list_head *page_list)
 {

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  8:28       ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-09-15  8:28 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li Shaohua

Neil,

Sorry for the rushed and imaginary ideas this morning..

> @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
>  	int lumpy_reclaim = 0;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
> +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> +			/* Not allowed to do IO, so mustn't wait
> +			 * on processes that might try to
> +			 */
> +			return SWAP_CLUSTER_MAX;
> +

The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
to cheat all the way up to believe "enough pages have been reclaimed".
So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
go on to call get_page_from_freelist(). That normally fails because
the task didn't really scanned the LRU lists. However it does have the
possibility to succeed -- when so many processes are doing concurrent
direct reclaims, it may luckily get one free page reclaimed by other
tasks. What's more, if it does fail to get a free page, the upper
layer __alloc_pages_slowpath() will be repeat recalling
__alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
"stealing" a free page reclaimed by other tasks.

In summary, the patch behavior for !__GFP_IO/FS is
- won't do any page reclaim
- won't fail the page allocation (unexpected)
- will wait and steal one free page from others (unreasonable)

So it will address the problem you encountered, however it sounds
pretty unexpected and illogical behavior, right?

I believe this patch will address the problem equally well.
What do you think?

Thanks,
Fengguang
---

mm: Avoid possible deadlock caused by too_many_isolated()

Neil finds that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim.  If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular
deadlock.

some task enters direct reclaim with GFP_KERNEL
  => too_many_isolated() false
    => vmscan and run into dirty pages
      => pageout()
        => take some FS lock
	  => fs/block code does GFP_NOIO allocation
	    => enter direct reclaim again
	      => too_many_isolated() true
		=> waiting for others to progress, however the other
		   tasks may be circular waiting for the FS lock..

The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by honouring them higher throttle threshold.

Now !__GFP_IO/FS reclaims won't be waiting for __GFP_IO/FS reclaims to
progress. They will be blocked only when there are too many concurrent
!__GFP_IO/FS reclaims, however that's very unlikely because the IO-less
direct reclaims is able to progress much more faster, and they won't
deadlock each other. The threshold is raised high enough for them, so
that there can be sufficient parallel progress of !__GFP_IO/FS reclaims.

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/vmscan.c	2010-09-15 11:58:58.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-09-15 15:36:14.000000000 +0800
@@ -1141,36 +1141,39 @@ int isolate_lru_page(struct page *page)
 	return ret;
 }
 
 /*
  * Are there way too many processes in the direct reclaim path already?
  */
 static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	int ratio;
 
 	if (current_is_kswapd())
 		return 0;
 
 	if (!scanning_global_lru(sc))
 		return 0;
 
 	if (file) {
 		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
 		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
 	} else {
 		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
-	return isolated > inactive;
+	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
+
+	return isolated > inactive * ratio;
 }
 
 /*
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void
 putback_lru_pages(struct zone *zone, struct scan_control *sc,
 				unsigned long nr_anon, unsigned long nr_file,
 				struct list_head *page_list)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  8:28       ` Wu Fengguang
@ 2010-09-15  8:44         ` Neil Brown
  -1 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-09-15  8:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li Shaohua

On Wed, 15 Sep 2010 16:28:43 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Neil,
> 
> Sorry for the rushed and imaginary ideas this morning..
> 
> > @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
> >  	int lumpy_reclaim = 0;
> >  
> >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> > +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> > +			/* Not allowed to do IO, so mustn't wait
> > +			 * on processes that might try to
> > +			 */
> > +			return SWAP_CLUSTER_MAX;
> > +
> 
> The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
> to cheat all the way up to believe "enough pages have been reclaimed".
> So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
> go on to call get_page_from_freelist(). That normally fails because
> the task didn't really scanned the LRU lists. However it does have the
> possibility to succeed -- when so many processes are doing concurrent
> direct reclaims, it may luckily get one free page reclaimed by other
> tasks. What's more, if it does fail to get a free page, the upper
> layer __alloc_pages_slowpath() will be repeat recalling
> __alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
> "stealing" a free page reclaimed by other tasks.
> 
> In summary, the patch behavior for !__GFP_IO/FS is
> - won't do any page reclaim
> - won't fail the page allocation (unexpected)
> - will wait and steal one free page from others (unreasonable)
> 
> So it will address the problem you encountered, however it sounds
> pretty unexpected and illogical behavior, right?
> 
> I believe this patch will address the problem equally well.
> What do you think?

Thank you for the detailed explanation.  Is agree with your reasoning and
now understand why your patch is sufficient.

I will get it tested and let you know how that goes.

Thanks,
NeilBrown


> 
> Thanks,
> Fengguang
> ---
> 
> mm: Avoid possible deadlock caused by too_many_isolated()
> 
> Neil finds that if too_many_isolated() returns true while performing
> direct reclaim we can end up waiting for other threads to complete their
> direct reclaim.  If those threads are allowed to enter the FS or IO to
> free memory, but this thread is not, then it is possible that those
> threads will be waiting on this thread and so we get a circular
> deadlock.
> 
> some task enters direct reclaim with GFP_KERNEL
>   => too_many_isolated() false
>     => vmscan and run into dirty pages
>       => pageout()
>         => take some FS lock
> 	  => fs/block code does GFP_NOIO allocation
> 	    => enter direct reclaim again
> 	      => too_many_isolated() true
> 		=> waiting for others to progress, however the other
> 		   tasks may be circular waiting for the FS lock..
> 
> The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> priority than normal ones, by honouring them higher throttle threshold.
> 
> Now !__GFP_IO/FS reclaims won't be waiting for __GFP_IO/FS reclaims to
> progress. They will be blocked only when there are too many concurrent
> !__GFP_IO/FS reclaims, however that's very unlikely because the IO-less
> direct reclaims is able to progress much more faster, and they won't
> deadlock each other. The threshold is raised high enough for them, so
> that there can be sufficient parallel progress of !__GFP_IO/FS reclaims.
> 
> Reported-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/vmscan.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/vmscan.c	2010-09-15 11:58:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-09-15 15:36:14.000000000 +0800
> @@ -1141,36 +1141,39 @@ int isolate_lru_page(struct page *page)
>  	return ret;
>  }
>  
>  /*
>   * Are there way too many processes in the direct reclaim path already?
>   */
>  static int too_many_isolated(struct zone *zone, int file,
>  		struct scan_control *sc)
>  {
>  	unsigned long inactive, isolated;
> +	int ratio;
>  
>  	if (current_is_kswapd())
>  		return 0;
>  
>  	if (!scanning_global_lru(sc))
>  		return 0;
>  
>  	if (file) {
>  		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
>  		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
>  	} else {
>  		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
>  		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
>  	}
>  
> -	return isolated > inactive;
> +	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
> +
> +	return isolated > inactive * ratio;
>  }
>  
>  /*
>   * TODO: Try merging with migrations version of putback_lru_pages
>   */
>  static noinline_for_stack void
>  putback_lru_pages(struct zone *zone, struct scan_control *sc,
>  				unsigned long nr_anon, unsigned long nr_file,
>  				struct list_head *page_list)
>  {


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-09-15  8:44         ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-09-15  8:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li Shaohua

On Wed, 15 Sep 2010 16:28:43 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Neil,
> 
> Sorry for the rushed and imaginary ideas this morning..
> 
> > @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
> >  	int lumpy_reclaim = 0;
> >  
> >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> > +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> > +			/* Not allowed to do IO, so mustn't wait
> > +			 * on processes that might try to
> > +			 */
> > +			return SWAP_CLUSTER_MAX;
> > +
> 
> The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
> to cheat all the way up to believe "enough pages have been reclaimed".
> So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
> go on to call get_page_from_freelist(). That normally fails because
> the task didn't really scanned the LRU lists. However it does have the
> possibility to succeed -- when so many processes are doing concurrent
> direct reclaims, it may luckily get one free page reclaimed by other
> tasks. What's more, if it does fail to get a free page, the upper
> layer __alloc_pages_slowpath() will be repeat recalling
> __alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
> "stealing" a free page reclaimed by other tasks.
> 
> In summary, the patch behavior for !__GFP_IO/FS is
> - won't do any page reclaim
> - won't fail the page allocation (unexpected)
> - will wait and steal one free page from others (unreasonable)
> 
> So it will address the problem you encountered, however it sounds
> pretty unexpected and illogical behavior, right?
> 
> I believe this patch will address the problem equally well.
> What do you think?

Thank you for the detailed explanation.  Is agree with your reasoning and
now understand why your patch is sufficient.

I will get it tested and let you know how that goes.

Thanks,
NeilBrown


> 
> Thanks,
> Fengguang
> ---
> 
> mm: Avoid possible deadlock caused by too_many_isolated()
> 
> Neil finds that if too_many_isolated() returns true while performing
> direct reclaim we can end up waiting for other threads to complete their
> direct reclaim.  If those threads are allowed to enter the FS or IO to
> free memory, but this thread is not, then it is possible that those
> threads will be waiting on this thread and so we get a circular
> deadlock.
> 
> some task enters direct reclaim with GFP_KERNEL
>   => too_many_isolated() false
>     => vmscan and run into dirty pages
>       => pageout()
>         => take some FS lock
> 	  => fs/block code does GFP_NOIO allocation
> 	    => enter direct reclaim again
> 	      => too_many_isolated() true
> 		=> waiting for others to progress, however the other
> 		   tasks may be circular waiting for the FS lock..
> 
> The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> priority than normal ones, by honouring them higher throttle threshold.
> 
> Now !__GFP_IO/FS reclaims won't be waiting for __GFP_IO/FS reclaims to
> progress. They will be blocked only when there are too many concurrent
> !__GFP_IO/FS reclaims, however that's very unlikely because the IO-less
> direct reclaims is able to progress much more faster, and they won't
> deadlock each other. The threshold is raised high enough for them, so
> that there can be sufficient parallel progress of !__GFP_IO/FS reclaims.
> 
> Reported-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/vmscan.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/vmscan.c	2010-09-15 11:58:58.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-09-15 15:36:14.000000000 +0800
> @@ -1141,36 +1141,39 @@ int isolate_lru_page(struct page *page)
>  	return ret;
>  }
>  
>  /*
>   * Are there way too many processes in the direct reclaim path already?
>   */
>  static int too_many_isolated(struct zone *zone, int file,
>  		struct scan_control *sc)
>  {
>  	unsigned long inactive, isolated;
> +	int ratio;
>  
>  	if (current_is_kswapd())
>  		return 0;
>  
>  	if (!scanning_global_lru(sc))
>  		return 0;
>  
>  	if (file) {
>  		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
>  		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
>  	} else {
>  		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
>  		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
>  	}
>  
> -	return isolated > inactive;
> +	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
> +
> +	return isolated > inactive * ratio;
>  }
>  
>  /*
>   * TODO: Try merging with migrations version of putback_lru_pages
>   */
>  static noinline_for_stack void
>  putback_lru_pages(struct zone *zone, struct scan_control *sc,
>  				unsigned long nr_anon, unsigned long nr_file,
>  				struct list_head *page_list)
>  {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-09-15  8:44         ` Neil Brown
@ 2010-10-18  4:14           ` Neil Brown
  -1 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-10-18  4:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li Shaohua

On Wed, 15 Sep 2010 18:44:34 +1000
Neil Brown <neilb@suse.de> wrote:

> On Wed, 15 Sep 2010 16:28:43 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Neil,
> > 
> > Sorry for the rushed and imaginary ideas this morning..
> > 
> > > @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
> > >  	int lumpy_reclaim = 0;
> > >  
> > >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> > > +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> > > +			/* Not allowed to do IO, so mustn't wait
> > > +			 * on processes that might try to
> > > +			 */
> > > +			return SWAP_CLUSTER_MAX;
> > > +
> > 
> > The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
> > to cheat all the way up to believe "enough pages have been reclaimed".
> > So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
> > go on to call get_page_from_freelist(). That normally fails because
> > the task didn't really scanned the LRU lists. However it does have the
> > possibility to succeed -- when so many processes are doing concurrent
> > direct reclaims, it may luckily get one free page reclaimed by other
> > tasks. What's more, if it does fail to get a free page, the upper
> > layer __alloc_pages_slowpath() will be repeat recalling
> > __alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
> > "stealing" a free page reclaimed by other tasks.
> > 
> > In summary, the patch behavior for !__GFP_IO/FS is
> > - won't do any page reclaim
> > - won't fail the page allocation (unexpected)
> > - will wait and steal one free page from others (unreasonable)
> > 
> > So it will address the problem you encountered, however it sounds
> > pretty unexpected and illogical behavior, right?
> > 
> > I believe this patch will address the problem equally well.
> > What do you think?
> 
> Thank you for the detailed explanation.  Is agree with your reasoning and
> now understand why your patch is sufficient.
> 
> I will get it tested and let you know how that goes.

(sorry this has taken a month to follow up).

Testing shows that this patch seems to work.
The test load (essentially kernbench) doesn't deadlock any more, though it
does get bogged down thrashing in swap so it doesn't make a lot more
progress :-)  I guess that is to be expected.

One observation is that the kernbench generated 10%-20% more context switches
with the patch than without.  Is that to be expected?

Do you have plans for sending this patch upstream?

Thanks,
NeilBrown


> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > Thanks,
> > Fengguang
> > ---
> > 
> > mm: Avoid possible deadlock caused by too_many_isolated()
> > 
> > Neil finds that if too_many_isolated() returns true while performing
> > direct reclaim we can end up waiting for other threads to complete their
> > direct reclaim.  If those threads are allowed to enter the FS or IO to
> > free memory, but this thread is not, then it is possible that those
> > threads will be waiting on this thread and so we get a circular
> > deadlock.
> > 
> > some task enters direct reclaim with GFP_KERNEL
> >   => too_many_isolated() false
> >     => vmscan and run into dirty pages
> >       => pageout()
> >         => take some FS lock
> > 	  => fs/block code does GFP_NOIO allocation
> > 	    => enter direct reclaim again
> > 	      => too_many_isolated() true
> > 		=> waiting for others to progress, however the other
> > 		   tasks may be circular waiting for the FS lock..
> > 
> > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> > priority than normal ones, by honouring them higher throttle threshold.
> > 
> > Now !__GFP_IO/FS reclaims won't be waiting for __GFP_IO/FS reclaims to
> > progress. They will be blocked only when there are too many concurrent
> > !__GFP_IO/FS reclaims, however that's very unlikely because the IO-less
> > direct reclaims is able to progress much more faster, and they won't
> > deadlock each other. The threshold is raised high enough for them, so
> > that there can be sufficient parallel progress of !__GFP_IO/FS reclaims.
> > 
> > Reported-by: NeilBrown <neilb@suse.de>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/vmscan.c |    5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > --- linux-next.orig/mm/vmscan.c	2010-09-15 11:58:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c	2010-09-15 15:36:14.000000000 +0800
> > @@ -1141,36 +1141,39 @@ int isolate_lru_page(struct page *page)
> >  	return ret;
> >  }
> >  
> >  /*
> >   * Are there way too many processes in the direct reclaim path already?
> >   */
> >  static int too_many_isolated(struct zone *zone, int file,
> >  		struct scan_control *sc)
> >  {
> >  	unsigned long inactive, isolated;
> > +	int ratio;
> >  
> >  	if (current_is_kswapd())
> >  		return 0;
> >  
> >  	if (!scanning_global_lru(sc))
> >  		return 0;
> >  
> >  	if (file) {
> >  		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> >  		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> >  	} else {
> >  		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> >  		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> >  	}
> >  
> > -	return isolated > inactive;
> > +	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
> > +
> > +	return isolated > inactive * ratio;
> >  }
> >  
> >  /*
> >   * TODO: Try merging with migrations version of putback_lru_pages
> >   */
> >  static noinline_for_stack void
> >  putback_lru_pages(struct zone *zone, struct scan_control *sc,
> >  				unsigned long nr_anon, unsigned long nr_file,
> >  				struct list_head *page_list)
> >  {
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-18  4:14           ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-10-18  4:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li Shaohua

On Wed, 15 Sep 2010 18:44:34 +1000
Neil Brown <neilb@suse.de> wrote:

> On Wed, 15 Sep 2010 16:28:43 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Neil,
> > 
> > Sorry for the rushed and imaginary ideas this morning..
> > 
> > > @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
> > >  	int lumpy_reclaim = 0;
> > >  
> > >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> > > +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> > > +			/* Not allowed to do IO, so mustn't wait
> > > +			 * on processes that might try to
> > > +			 */
> > > +			return SWAP_CLUSTER_MAX;
> > > +
> > 
> > The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
> > to cheat all the way up to believe "enough pages have been reclaimed".
> > So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
> > go on to call get_page_from_freelist(). That normally fails because
> > the task didn't really scanned the LRU lists. However it does have the
> > possibility to succeed -- when so many processes are doing concurrent
> > direct reclaims, it may luckily get one free page reclaimed by other
> > tasks. What's more, if it does fail to get a free page, the upper
> > layer __alloc_pages_slowpath() will be repeat recalling
> > __alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
> > "stealing" a free page reclaimed by other tasks.
> > 
> > In summary, the patch behavior for !__GFP_IO/FS is
> > - won't do any page reclaim
> > - won't fail the page allocation (unexpected)
> > - will wait and steal one free page from others (unreasonable)
> > 
> > So it will address the problem you encountered, however it sounds
> > pretty unexpected and illogical behavior, right?
> > 
> > I believe this patch will address the problem equally well.
> > What do you think?
> 
> Thank you for the detailed explanation.  Is agree with your reasoning and
> now understand why your patch is sufficient.
> 
> I will get it tested and let you know how that goes.

(sorry this has taken a month to follow up).

Testing shows that this patch seems to work.
The test load (essentially kernbench) doesn't deadlock any more, though it
does get bogged down thrashing in swap so it doesn't make a lot more
progress :-)  I guess that is to be expected.

One observation is that the kernbench generated 10%-20% more context switches
with the patch than without.  Is that to be expected?

Do you have plans for sending this patch upstream?

Thanks,
NeilBrown


> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > Thanks,
> > Fengguang
> > ---
> > 
> > mm: Avoid possible deadlock caused by too_many_isolated()
> > 
> > Neil finds that if too_many_isolated() returns true while performing
> > direct reclaim we can end up waiting for other threads to complete their
> > direct reclaim.  If those threads are allowed to enter the FS or IO to
> > free memory, but this thread is not, then it is possible that those
> > threads will be waiting on this thread and so we get a circular
> > deadlock.
> > 
> > some task enters direct reclaim with GFP_KERNEL
> >   => too_many_isolated() false
> >     => vmscan and run into dirty pages
> >       => pageout()
> >         => take some FS lock
> > 	  => fs/block code does GFP_NOIO allocation
> > 	    => enter direct reclaim again
> > 	      => too_many_isolated() true
> > 		=> waiting for others to progress, however the other
> > 		   tasks may be circular waiting for the FS lock..
> > 
> > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> > priority than normal ones, by honouring them higher throttle threshold.
> > 
> > Now !__GFP_IO/FS reclaims won't be waiting for __GFP_IO/FS reclaims to
> > progress. They will be blocked only when there are too many concurrent
> > !__GFP_IO/FS reclaims, however that's very unlikely because the IO-less
> > direct reclaims is able to progress much more faster, and they won't
> > deadlock each other. The threshold is raised high enough for them, so
> > that there can be sufficient parallel progress of !__GFP_IO/FS reclaims.
> > 
> > Reported-by: NeilBrown <neilb@suse.de>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/vmscan.c |    5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > --- linux-next.orig/mm/vmscan.c	2010-09-15 11:58:58.000000000 +0800
> > +++ linux-next/mm/vmscan.c	2010-09-15 15:36:14.000000000 +0800
> > @@ -1141,36 +1141,39 @@ int isolate_lru_page(struct page *page)
> >  	return ret;
> >  }
> >  
> >  /*
> >   * Are there way too many processes in the direct reclaim path already?
> >   */
> >  static int too_many_isolated(struct zone *zone, int file,
> >  		struct scan_control *sc)
> >  {
> >  	unsigned long inactive, isolated;
> > +	int ratio;
> >  
> >  	if (current_is_kswapd())
> >  		return 0;
> >  
> >  	if (!scanning_global_lru(sc))
> >  		return 0;
> >  
> >  	if (file) {
> >  		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> >  		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> >  	} else {
> >  		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> >  		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> >  	}
> >  
> > -	return isolated > inactive;
> > +	ratio = sc->gfp_mask & (__GFP_IO | __GFP_FS) ? 1 : 8;
> > +
> > +	return isolated > inactive * ratio;
> >  }
> >  
> >  /*
> >   * TODO: Try merging with migrations version of putback_lru_pages
> >   */
> >  static noinline_for_stack void
> >  putback_lru_pages(struct zone *zone, struct scan_control *sc,
> >  				unsigned long nr_anon, unsigned long nr_file,
> >  				struct list_head *page_list)
> >  {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18  4:14           ` Neil Brown
@ 2010-10-18  5:04             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-18  5:04 UTC (permalink / raw)
  To: Neil Brown, Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li Shaohua

> On Wed, 15 Sep 2010 18:44:34 +1000
> Neil Brown <neilb@suse.de> wrote:
> 
> > On Wed, 15 Sep 2010 16:28:43 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Neil,
> > > 
> > > Sorry for the rushed and imaginary ideas this morning..
> > > 
> > > > @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
> > > >  	int lumpy_reclaim = 0;
> > > >  
> > > >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> > > > +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> > > > +			/* Not allowed to do IO, so mustn't wait
> > > > +			 * on processes that might try to
> > > > +			 */
> > > > +			return SWAP_CLUSTER_MAX;
> > > > +
> > > 
> > > The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
> > > to cheat all the way up to believe "enough pages have been reclaimed".
> > > So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
> > > go on to call get_page_from_freelist(). That normally fails because
> > > the task didn't really scanned the LRU lists. However it does have the
> > > possibility to succeed -- when so many processes are doing concurrent
> > > direct reclaims, it may luckily get one free page reclaimed by other
> > > tasks. What's more, if it does fail to get a free page, the upper
> > > layer __alloc_pages_slowpath() will be repeat recalling
> > > __alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
> > > "stealing" a free page reclaimed by other tasks.
> > > 
> > > In summary, the patch behavior for !__GFP_IO/FS is
> > > - won't do any page reclaim
> > > - won't fail the page allocation (unexpected)
> > > - will wait and steal one free page from others (unreasonable)
> > > 
> > > So it will address the problem you encountered, however it sounds
> > > pretty unexpected and illogical behavior, right?
> > > 
> > > I believe this patch will address the problem equally well.
> > > What do you think?
> > 
> > Thank you for the detailed explanation.  Is agree with your reasoning and
> > now understand why your patch is sufficient.
> > 
> > I will get it tested and let you know how that goes.
> 
> (sorry this has taken a month to follow up).
> 
> Testing shows that this patch seems to work.
> The test load (essentially kernbench) doesn't deadlock any more, though it
> does get bogged down thrashing in swap so it doesn't make a lot more
> progress :-)  I guess that is to be expected.
> 
> One observation is that the kernbench generated 10%-20% more context switches
> with the patch than without.  Is that to be expected?
> 
> Do you have plans for sending this patch upstream?

Wow, I had thought this patch has been merged already. Wu, can you please
repost this one? and please add my and Neil's ack tag.

Thanks.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-18  5:04             ` KOSAKI Motohiro
  0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-18  5:04 UTC (permalink / raw)
  To: Neil Brown, Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, Andrew Morton, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li Shaohua

> On Wed, 15 Sep 2010 18:44:34 +1000
> Neil Brown <neilb@suse.de> wrote:
> 
> > On Wed, 15 Sep 2010 16:28:43 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Neil,
> > > 
> > > Sorry for the rushed and imaginary ideas this morning..
> > > 
> > > > @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
> > > >  	int lumpy_reclaim = 0;
> > > >  
> > > >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> > > > +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> > > > +			/* Not allowed to do IO, so mustn't wait
> > > > +			 * on processes that might try to
> > > > +			 */
> > > > +			return SWAP_CLUSTER_MAX;
> > > > +
> > > 
> > > The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
> > > to cheat all the way up to believe "enough pages have been reclaimed".
> > > So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
> > > go on to call get_page_from_freelist(). That normally fails because
> > > the task didn't really scanned the LRU lists. However it does have the
> > > possibility to succeed -- when so many processes are doing concurrent
> > > direct reclaims, it may luckily get one free page reclaimed by other
> > > tasks. What's more, if it does fail to get a free page, the upper
> > > layer __alloc_pages_slowpath() will be repeat recalling
> > > __alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
> > > "stealing" a free page reclaimed by other tasks.
> > > 
> > > In summary, the patch behavior for !__GFP_IO/FS is
> > > - won't do any page reclaim
> > > - won't fail the page allocation (unexpected)
> > > - will wait and steal one free page from others (unreasonable)
> > > 
> > > So it will address the problem you encountered, however it sounds
> > > pretty unexpected and illogical behavior, right?
> > > 
> > > I believe this patch will address the problem equally well.
> > > What do you think?
> > 
> > Thank you for the detailed explanation.  Is agree with your reasoning and
> > now understand why your patch is sufficient.
> > 
> > I will get it tested and let you know how that goes.
> 
> (sorry this has taken a month to follow up).
> 
> Testing shows that this patch seems to work.
> The test load (essentially kernbench) doesn't deadlock any more, though it
> does get bogged down thrashing in swap so it doesn't make a lot more
> progress :-)  I guess that is to be expected.
> 
> One observation is that the kernbench generated 10%-20% more context switches
> with the patch than without.  Is that to be expected?
> 
> Do you have plans for sending this patch upstream?

Wow, I had thought this patch has been merged already. Wu, can you please
repost this one? and please add my and Neil's ack tag.

Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18  4:14           ` Neil Brown
@ 2010-10-18 10:58             ` Torsten Kaiser
  -1 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-18 10:58 UTC (permalink / raw)
  To: Neil Brown
  Cc: Wu Fengguang, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li Shaohua

On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@suse.de> wrote:
> Testing shows that this patch seems to work.
> The test load (essentially kernbench) doesn't deadlock any more, though it
> does get bogged down thrashing in swap so it doesn't make a lot more
> progress :-)  I guess that is to be expected.

I just noticed this thread, as your mail from today pushed it up.

In your original mail you wrote: " I recently had a customer (running
2.6.32) report a deadlock during very intensive IO with lots of
processes. " and " Some threads that are blocked there, hold some IO
lock (probably in the filesystem) and are trying to allocate memory
inside the block device (md/raid1 to be precise) which is allocating
with GFP_NOIO and has a mempool to fall back on."

I recently had the same problem (intense IO due to swapstorm created
by 20 gcc processes hung my system) and after initially blaming the
workqueue changes in 2.6.36 Tejun Heo determined that my problem was
not the workqueues getting locked up, but that it was cause by an
exhausted mempool:
http://marc.info/?l=linux-kernel&m=128655737012549&w=2

Instrumenting mm/mempool.c and retrying my workload showed that
fs_bio_set from fs/bio.c looked like the mempool to blame and the code
in drivers/md/raid1.c to be the misuser:
http://marc.info/?l=linux-kernel&m=128671179817823&w=2

I was even able to reproduce this hang with only using a normal RAID1
md device as swapspace and then using dd to fill a tmpfs until
swapping was needed:
http://marc.info/?l=linux-raid&m=128699402805191&w=2

Looking back in the history of raid1.c and bio.c I found the following
interesting parts:

 * the change to allocate more then one bio via bio_clone() is from
2005, but it looks like it was OK back then, because at that point the
fs_bio_set was allocation 256 entries
 * in 2007 the size of the mempool was changed from 256 to only 2
entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is
enough, lets scale it down to 2 just to be on the safe side.")
 * only in 2009 the comment "To make this work, callers must never
allocate more than 1 bio at the time from this pool. Callers that need
to allocate more than 1 bio must always submit the previously allocate
bio for IO before attempting to allocate a new one. Failure to do so
can cause livelocks under memory pressure." was added to bio_alloc()
that is the base from my reasoning that raid1.c is broken. (And such a
comment was not added to bio_clone() although both calls use the same
mempool)

So could please look someone into raid1.c to confirm or deny that
using multiple bio_clone() (one per drive) before submitting them
together could also cause such deadlocks?

Thank for looking

Torsten

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-18 10:58             ` Torsten Kaiser
  0 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-18 10:58 UTC (permalink / raw)
  To: Neil Brown
  Cc: Wu Fengguang, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li Shaohua

On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@suse.de> wrote:
> Testing shows that this patch seems to work.
> The test load (essentially kernbench) doesn't deadlock any more, though it
> does get bogged down thrashing in swap so it doesn't make a lot more
> progress :-)  I guess that is to be expected.

I just noticed this thread, as your mail from today pushed it up.

In your original mail you wrote: " I recently had a customer (running
2.6.32) report a deadlock during very intensive IO with lots of
processes. " and " Some threads that are blocked there, hold some IO
lock (probably in the filesystem) and are trying to allocate memory
inside the block device (md/raid1 to be precise) which is allocating
with GFP_NOIO and has a mempool to fall back on."

I recently had the same problem (intense IO due to swapstorm created
by 20 gcc processes hung my system) and after initially blaming the
workqueue changes in 2.6.36 Tejun Heo determined that my problem was
not the workqueues getting locked up, but that it was cause by an
exhausted mempool:
http://marc.info/?l=linux-kernel&m=128655737012549&w=2

Instrumenting mm/mempool.c and retrying my workload showed that
fs_bio_set from fs/bio.c looked like the mempool to blame and the code
in drivers/md/raid1.c to be the misuser:
http://marc.info/?l=linux-kernel&m=128671179817823&w=2

I was even able to reproduce this hang with only using a normal RAID1
md device as swapspace and then using dd to fill a tmpfs until
swapping was needed:
http://marc.info/?l=linux-raid&m=128699402805191&w=2

Looking back in the history of raid1.c and bio.c I found the following
interesting parts:

 * the change to allocate more then one bio via bio_clone() is from
2005, but it looks like it was OK back then, because at that point the
fs_bio_set was allocation 256 entries
 * in 2007 the size of the mempool was changed from 256 to only 2
entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is
enough, lets scale it down to 2 just to be on the safe side.")
 * only in 2009 the comment "To make this work, callers must never
allocate more than 1 bio at the time from this pool. Callers that need
to allocate more than 1 bio must always submit the previously allocate
bio for IO before attempting to allocate a new one. Failure to do so
can cause livelocks under memory pressure." was added to bio_alloc()
that is the base from my reasoning that raid1.c is broken. (And such a
comment was not added to bio_clone() although both calls use the same
mempool)

So could please look someone into raid1.c to confirm or deny that
using multiple bio_clone() (one per drive) before submitting them
together could also cause such deadlocks?

Thank for looking

Torsten

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18  4:14           ` Neil Brown
@ 2010-10-18 16:15             ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-18 16:15 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Mon, Oct 18, 2010 at 12:14:59PM +0800, Neil Brown wrote:
> On Wed, 15 Sep 2010 18:44:34 +1000
> Neil Brown <neilb@suse.de> wrote:
> 
> > On Wed, 15 Sep 2010 16:28:43 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Neil,
> > > 
> > > Sorry for the rushed and imaginary ideas this morning..
> > > 
> > > > @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
> > > >  	int lumpy_reclaim = 0;
> > > >  
> > > >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> > > > +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> > > > +			/* Not allowed to do IO, so mustn't wait
> > > > +			 * on processes that might try to
> > > > +			 */
> > > > +			return SWAP_CLUSTER_MAX;
> > > > +
> > > 
> > > The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
> > > to cheat all the way up to believe "enough pages have been reclaimed".
> > > So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
> > > go on to call get_page_from_freelist(). That normally fails because
> > > the task didn't really scanned the LRU lists. However it does have the
> > > possibility to succeed -- when so many processes are doing concurrent
> > > direct reclaims, it may luckily get one free page reclaimed by other
> > > tasks. What's more, if it does fail to get a free page, the upper
> > > layer __alloc_pages_slowpath() will be repeat recalling
> > > __alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
> > > "stealing" a free page reclaimed by other tasks.
> > > 
> > > In summary, the patch behavior for !__GFP_IO/FS is
> > > - won't do any page reclaim
> > > - won't fail the page allocation (unexpected)
> > > - will wait and steal one free page from others (unreasonable)
> > > 
> > > So it will address the problem you encountered, however it sounds
> > > pretty unexpected and illogical behavior, right?
> > > 
> > > I believe this patch will address the problem equally well.
> > > What do you think?
> > 
> > Thank you for the detailed explanation.  Is agree with your reasoning and
> > now understand why your patch is sufficient.
> > 
> > I will get it tested and let you know how that goes.
> 
> (sorry this has taken a month to follow up).
> 
> Testing shows that this patch seems to work.
> The test load (essentially kernbench) doesn't deadlock any more, though it

Good news, thanks for the test!

> does get bogged down thrashing in swap so it doesn't make a lot more
> progress :-)  I guess that is to be expected.
 
The patch does allow more isolated pages, which may lead to more
pressure on the LRU lists and hence swapping (or vmscan file writes?).

> One observation is that the kernbench generated 10%-20% more context switches
> with the patch than without.  Is that to be expected?

It's total number of context switches? It may be due to the increased
swapping as well.

> Do you have plans for sending this patch upstream?

Would you help try the modified patch? It tries to reduce the number
of isolated pages. Hope it helps reduce the thrashing. I also noticed
that the original patch only covers the GFP_NOIO case and missed GFP_NOFS.

Thanks,
Fengguang
---
Subject: mm: Avoid possible deadlock caused by too_many_isolated()
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Sep 15 15:36:19 CST 2010

Neil find that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim.  If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular
deadlock.

some task enters direct reclaim with GFP_KERNEL
  => too_many_isolated() false
    => vmscan and run into dirty pages
      => pageout()
        => take some FS lock
	  => fs/block code does GFP_NOIO allocation
	    => enter direct reclaim again
	      => too_many_isolated() true
		=> waiting for others to progress, however the other
		   tasks may be circular waiting for the FS lock..

The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by honouring them higher throttle threshold.

Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
progress. They will be blocked only when there are too many concurrent
!GFP_IOFS reclaims, however that's very unlikely because the IO-less
direct reclaims is able to progress much more faster, and they won't
deadlock each other. The threshold is raised high enough for them, so
that there can be sufficient parallel progress of !GFP_IOFS reclaims.

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- linux-next.orig/mm/vmscan.c	2010-10-13 12:35:14.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-10-19 00:13:04.000000000 +0800
@@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
+	/*
+	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
+	 * they won't get blocked by normal ones and form circular deadlock.
+	 */
+	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
+		inactive >>= 3;
+
 	return isolated > inactive;
 }
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-18 16:15             ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-18 16:15 UTC (permalink / raw)
  To: Neil Brown
  Cc: Rik van Riel, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Mon, Oct 18, 2010 at 12:14:59PM +0800, Neil Brown wrote:
> On Wed, 15 Sep 2010 18:44:34 +1000
> Neil Brown <neilb@suse.de> wrote:
> 
> > On Wed, 15 Sep 2010 16:28:43 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Neil,
> > > 
> > > Sorry for the rushed and imaginary ideas this morning..
> > > 
> > > > @@ -1101,6 +1101,12 @@ static unsigned long shrink_inactive_lis
> > > >  	int lumpy_reclaim = 0;
> > > >  
> > > >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> > > > +		if ((sc->gfp_mask & GFP_IOFS) != GFP_IOFS)
> > > > +			/* Not allowed to do IO, so mustn't wait
> > > > +			 * on processes that might try to
> > > > +			 */
> > > > +			return SWAP_CLUSTER_MAX;
> > > > +
> > > 
> > > The above patch should behavior like this: it returns SWAP_CLUSTER_MAX
> > > to cheat all the way up to believe "enough pages have been reclaimed".
> > > So __alloc_pages_direct_reclaim() see non-zero *did_some_progress and
> > > go on to call get_page_from_freelist(). That normally fails because
> > > the task didn't really scanned the LRU lists. However it does have the
> > > possibility to succeed -- when so many processes are doing concurrent
> > > direct reclaims, it may luckily get one free page reclaimed by other
> > > tasks. What's more, if it does fail to get a free page, the upper
> > > layer __alloc_pages_slowpath() will be repeat recalling
> > > __alloc_pages_direct_reclaim(). So, sooner or later it will succeed in
> > > "stealing" a free page reclaimed by other tasks.
> > > 
> > > In summary, the patch behavior for !__GFP_IO/FS is
> > > - won't do any page reclaim
> > > - won't fail the page allocation (unexpected)
> > > - will wait and steal one free page from others (unreasonable)
> > > 
> > > So it will address the problem you encountered, however it sounds
> > > pretty unexpected and illogical behavior, right?
> > > 
> > > I believe this patch will address the problem equally well.
> > > What do you think?
> > 
> > Thank you for the detailed explanation.  Is agree with your reasoning and
> > now understand why your patch is sufficient.
> > 
> > I will get it tested and let you know how that goes.
> 
> (sorry this has taken a month to follow up).
> 
> Testing shows that this patch seems to work.
> The test load (essentially kernbench) doesn't deadlock any more, though it

Good news, thanks for the test!

> does get bogged down thrashing in swap so it doesn't make a lot more
> progress :-)  I guess that is to be expected.
 
The patch does allow more isolated pages, which may lead to more
pressure on the LRU lists and hence swapping (or vmscan file writes?).

> One observation is that the kernbench generated 10%-20% more context switches
> with the patch than without.  Is that to be expected?

It's total number of context switches? It may be due to the increased
swapping as well.

> Do you have plans for sending this patch upstream?

Would you help try the modified patch? It tries to reduce the number
of isolated pages. Hope it helps reduce the thrashing. I also noticed
that the original patch only covers the GFP_NOIO case and missed GFP_NOFS.

Thanks,
Fengguang
---
Subject: mm: Avoid possible deadlock caused by too_many_isolated()
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Sep 15 15:36:19 CST 2010

Neil find that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim.  If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular
deadlock.

some task enters direct reclaim with GFP_KERNEL
  => too_many_isolated() false
    => vmscan and run into dirty pages
      => pageout()
        => take some FS lock
	  => fs/block code does GFP_NOIO allocation
	    => enter direct reclaim again
	      => too_many_isolated() true
		=> waiting for others to progress, however the other
		   tasks may be circular waiting for the FS lock..

The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by honouring them higher throttle threshold.

Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
progress. They will be blocked only when there are too many concurrent
!GFP_IOFS reclaims, however that's very unlikely because the IO-less
direct reclaims is able to progress much more faster, and they won't
deadlock each other. The threshold is raised high enough for them, so
that there can be sufficient parallel progress of !GFP_IOFS reclaims.

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- linux-next.orig/mm/vmscan.c	2010-10-13 12:35:14.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-10-19 00:13:04.000000000 +0800
@@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
 		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
 	}
 
+	/*
+	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
+	 * they won't get blocked by normal ones and form circular deadlock.
+	 */
+	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
+		inactive >>= 3;
+
 	return isolated > inactive;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18 16:15             ` Wu Fengguang
@ 2010-10-18 21:58               ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2010-10-18 21:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Neil Brown, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Tue, 19 Oct 2010 00:15:04 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Neil find that if too_many_isolated() returns true while performing
> direct reclaim we can end up waiting for other threads to complete their
> direct reclaim.  If those threads are allowed to enter the FS or IO to
> free memory, but this thread is not, then it is possible that those
> threads will be waiting on this thread and so we get a circular
> deadlock.
> 
> some task enters direct reclaim with GFP_KERNEL
>   => too_many_isolated() false
>     => vmscan and run into dirty pages
>       => pageout()
>         => take some FS lock
> 	  => fs/block code does GFP_NOIO allocation
> 	    => enter direct reclaim again
> 	      => too_many_isolated() true
> 		=> waiting for others to progress, however the other
> 		   tasks may be circular waiting for the FS lock..
> 
> The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> priority than normal ones, by honouring them higher throttle threshold.
> 
> Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
> progress. They will be blocked only when there are too many concurrent
> !GFP_IOFS reclaims, however that's very unlikely because the IO-less
> direct reclaims is able to progress much more faster, and they won't
> deadlock each other. The threshold is raised high enough for them, so
> that there can be sufficient parallel progress of !GFP_IOFS reclaims.

I'm not sure that this is really a full fix.  Torsten's analysis does
appear to point at the real bug: raid1 has code paths which allocate
more than a single element from a mempool without starting IO against
previous elements.

Giving these allocations the ability to dip further into reserves will
make occurrence of the bug less likely, but if enough threads all do
this at the same time, that reserve will be exhausted and we're back to
square one?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-18 21:58               ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2010-10-18 21:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Neil Brown, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Tue, 19 Oct 2010 00:15:04 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Neil find that if too_many_isolated() returns true while performing
> direct reclaim we can end up waiting for other threads to complete their
> direct reclaim.  If those threads are allowed to enter the FS or IO to
> free memory, but this thread is not, then it is possible that those
> threads will be waiting on this thread and so we get a circular
> deadlock.
> 
> some task enters direct reclaim with GFP_KERNEL
>   => too_many_isolated() false
>     => vmscan and run into dirty pages
>       => pageout()
>         => take some FS lock
> 	  => fs/block code does GFP_NOIO allocation
> 	    => enter direct reclaim again
> 	      => too_many_isolated() true
> 		=> waiting for others to progress, however the other
> 		   tasks may be circular waiting for the FS lock..
> 
> The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> priority than normal ones, by honouring them higher throttle threshold.
> 
> Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
> progress. They will be blocked only when there are too many concurrent
> !GFP_IOFS reclaims, however that's very unlikely because the IO-less
> direct reclaims is able to progress much more faster, and they won't
> deadlock each other. The threshold is raised high enough for them, so
> that there can be sufficient parallel progress of !GFP_IOFS reclaims.

I'm not sure that this is really a full fix.  Torsten's analysis does
appear to point at the real bug: raid1 has code paths which allocate
more than a single element from a mempool without starting IO against
previous elements.

Giving these allocations the ability to dip further into reserves will
make occurrence of the bug less likely, but if enough threads all do
this at the same time, that reserve will be exhausted and we're back to
square one?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18 21:58               ` Andrew Morton
@ 2010-10-18 22:31                 ` Neil Brown
  -1 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-10-18 22:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Mon, 18 Oct 2010 14:58:59 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 19 Oct 2010 00:15:04 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Neil find that if too_many_isolated() returns true while performing
> > direct reclaim we can end up waiting for other threads to complete their
> > direct reclaim.  If those threads are allowed to enter the FS or IO to
> > free memory, but this thread is not, then it is possible that those
> > threads will be waiting on this thread and so we get a circular
> > deadlock.
> > 
> > some task enters direct reclaim with GFP_KERNEL
> >   => too_many_isolated() false
> >     => vmscan and run into dirty pages
> >       => pageout()
> >         => take some FS lock
> > 	  => fs/block code does GFP_NOIO allocation
> > 	    => enter direct reclaim again
> > 	      => too_many_isolated() true
> > 		=> waiting for others to progress, however the other
> > 		   tasks may be circular waiting for the FS lock..
> > 
> > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> > priority than normal ones, by honouring them higher throttle threshold.
> > 
> > Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
> > progress. They will be blocked only when there are too many concurrent
> > !GFP_IOFS reclaims, however that's very unlikely because the IO-less
> > direct reclaims is able to progress much more faster, and they won't
> > deadlock each other. The threshold is raised high enough for them, so
> > that there can be sufficient parallel progress of !GFP_IOFS reclaims.
> 
> I'm not sure that this is really a full fix.  Torsten's analysis does
> appear to point at the real bug: raid1 has code paths which allocate
> more than a single element from a mempool without starting IO against
> previous elements.

... point at "a" real bug.

I think there are two bugs here.
The raid1 bug that Torsten mentions is certainly real (and has been around
for an embarrassingly long time).
The bug that I identified in too_many_isolated is also a real bug and can be
triggered without md/raid1 in the mix.
So this is not a 'full fix' for every bug in the kernel :-), but it could
well be a full fix for this particular bug.

NeilBrown

> 
> Giving these allocations the ability to dip further into reserves will
> make occurrence of the bug less likely, but if enough threads all do
> this at the same time, that reserve will be exhausted and we're back to
> square one?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-18 22:31                 ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-10-18 22:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Mon, 18 Oct 2010 14:58:59 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 19 Oct 2010 00:15:04 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Neil find that if too_many_isolated() returns true while performing
> > direct reclaim we can end up waiting for other threads to complete their
> > direct reclaim.  If those threads are allowed to enter the FS or IO to
> > free memory, but this thread is not, then it is possible that those
> > threads will be waiting on this thread and so we get a circular
> > deadlock.
> > 
> > some task enters direct reclaim with GFP_KERNEL
> >   => too_many_isolated() false
> >     => vmscan and run into dirty pages
> >       => pageout()
> >         => take some FS lock
> > 	  => fs/block code does GFP_NOIO allocation
> > 	    => enter direct reclaim again
> > 	      => too_many_isolated() true
> > 		=> waiting for others to progress, however the other
> > 		   tasks may be circular waiting for the FS lock..
> > 
> > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> > priority than normal ones, by honouring them higher throttle threshold.
> > 
> > Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
> > progress. They will be blocked only when there are too many concurrent
> > !GFP_IOFS reclaims, however that's very unlikely because the IO-less
> > direct reclaims is able to progress much more faster, and they won't
> > deadlock each other. The threshold is raised high enough for them, so
> > that there can be sufficient parallel progress of !GFP_IOFS reclaims.
> 
> I'm not sure that this is really a full fix.  Torsten's analysis does
> appear to point at the real bug: raid1 has code paths which allocate
> more than a single element from a mempool without starting IO against
> previous elements.

... point at "a" real bug.

I think there are two bugs here.
The raid1 bug that Torsten mentions is certainly real (and has been around
for an embarrassingly long time).
The bug that I identified in too_many_isolated is also a real bug and can be
triggered without md/raid1 in the mix.
So this is not a 'full fix' for every bug in the kernel :-), but it could
well be a full fix for this particular bug.

NeilBrown

> 
> Giving these allocations the ability to dip further into reserves will
> make occurrence of the bug less likely, but if enough threads all do
> this at the same time, that reserve will be exhausted and we're back to
> square one?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18 22:31                 ` Neil Brown
@ 2010-10-18 22:41                   ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2010-10-18 22:41 UTC (permalink / raw)
  To: Neil Brown
  Cc: Wu Fengguang, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Tue, 19 Oct 2010 09:31:42 +1100
Neil Brown <neilb@suse.de> wrote:

> On Mon, 18 Oct 2010 14:58:59 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Tue, 19 Oct 2010 00:15:04 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Neil find that if too_many_isolated() returns true while performing
> > > direct reclaim we can end up waiting for other threads to complete their
> > > direct reclaim.  If those threads are allowed to enter the FS or IO to
> > > free memory, but this thread is not, then it is possible that those
> > > threads will be waiting on this thread and so we get a circular
> > > deadlock.
> > > 
> > > some task enters direct reclaim with GFP_KERNEL
> > >   => too_many_isolated() false
> > >     => vmscan and run into dirty pages
> > >       => pageout()
> > >         => take some FS lock
> > > 	  => fs/block code does GFP_NOIO allocation
> > > 	    => enter direct reclaim again
> > > 	      => too_many_isolated() true
> > > 		=> waiting for others to progress, however the other
> > > 		   tasks may be circular waiting for the FS lock..

I'm assuming that the last four "=>"'s here should have been indented
another stop.

> > > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> > > priority than normal ones, by honouring them higher throttle threshold.
> > > 
> > > Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
> > > progress. They will be blocked only when there are too many concurrent
> > > !GFP_IOFS reclaims, however that's very unlikely because the IO-less
> > > direct reclaims is able to progress much more faster, and they won't
> > > deadlock each other. The threshold is raised high enough for them, so
> > > that there can be sufficient parallel progress of !GFP_IOFS reclaims.
> > 
> > I'm not sure that this is really a full fix.  Torsten's analysis does
> > appear to point at the real bug: raid1 has code paths which allocate
> > more than a single element from a mempool without starting IO against
> > previous elements.
> 
> ... point at "a" real bug.
> 
> I think there are two bugs here.
> The raid1 bug that Torsten mentions is certainly real (and has been around
> for an embarrassingly long time).
> The bug that I identified in too_many_isolated is also a real bug and can be
> triggered without md/raid1 in the mix.
> So this is not a 'full fix' for every bug in the kernel :-), but it could
> well be a full fix for this particular bug.
> 

Can we just delete the too_many_isolated() logic?  (Crappy comment
describes what the code does but not why it does it).


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-18 22:41                   ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2010-10-18 22:41 UTC (permalink / raw)
  To: Neil Brown
  Cc: Wu Fengguang, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Tue, 19 Oct 2010 09:31:42 +1100
Neil Brown <neilb@suse.de> wrote:

> On Mon, 18 Oct 2010 14:58:59 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Tue, 19 Oct 2010 00:15:04 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Neil find that if too_many_isolated() returns true while performing
> > > direct reclaim we can end up waiting for other threads to complete their
> > > direct reclaim.  If those threads are allowed to enter the FS or IO to
> > > free memory, but this thread is not, then it is possible that those
> > > threads will be waiting on this thread and so we get a circular
> > > deadlock.
> > > 
> > > some task enters direct reclaim with GFP_KERNEL
> > >   => too_many_isolated() false
> > >     => vmscan and run into dirty pages
> > >       => pageout()
> > >         => take some FS lock
> > > 	  => fs/block code does GFP_NOIO allocation
> > > 	    => enter direct reclaim again
> > > 	      => too_many_isolated() true
> > > 		=> waiting for others to progress, however the other
> > > 		   tasks may be circular waiting for the FS lock..

I'm assuming that the last four "=>"'s here should have been indented
another stop.

> > > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> > > priority than normal ones, by honouring them higher throttle threshold.
> > > 
> > > Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
> > > progress. They will be blocked only when there are too many concurrent
> > > !GFP_IOFS reclaims, however that's very unlikely because the IO-less
> > > direct reclaims is able to progress much more faster, and they won't
> > > deadlock each other. The threshold is raised high enough for them, so
> > > that there can be sufficient parallel progress of !GFP_IOFS reclaims.
> > 
> > I'm not sure that this is really a full fix.  Torsten's analysis does
> > appear to point at the real bug: raid1 has code paths which allocate
> > more than a single element from a mempool without starting IO against
> > previous elements.
> 
> ... point at "a" real bug.
> 
> I think there are two bugs here.
> The raid1 bug that Torsten mentions is certainly real (and has been around
> for an embarrassingly long time).
> The bug that I identified in too_many_isolated is also a real bug and can be
> triggered without md/raid1 in the mix.
> So this is not a 'full fix' for every bug in the kernel :-), but it could
> well be a full fix for this particular bug.
> 

Can we just delete the too_many_isolated() logic?  (Crappy comment
describes what the code does but not why it does it).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18 10:58             ` Torsten Kaiser
@ 2010-10-18 23:11               ` Neil Brown
  -1 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-10-18 23:11 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Wu Fengguang, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li Shaohua

On Mon, 18 Oct 2010 12:58:17 +0200
Torsten Kaiser <just.for.lkml@googlemail.com> wrote:

> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@suse.de> wrote:
> > Testing shows that this patch seems to work.
> > The test load (essentially kernbench) doesn't deadlock any more, though it
> > does get bogged down thrashing in swap so it doesn't make a lot more
> > progress :-)  I guess that is to be expected.
> 
> I just noticed this thread, as your mail from today pushed it up.
> 
> In your original mail you wrote: " I recently had a customer (running
> 2.6.32) report a deadlock during very intensive IO with lots of
> processes. " and " Some threads that are blocked there, hold some IO
> lock (probably in the filesystem) and are trying to allocate memory
> inside the block device (md/raid1 to be precise) which is allocating
> with GFP_NOIO and has a mempool to fall back on."
> 
> I recently had the same problem (intense IO due to swapstorm created
> by 20 gcc processes hung my system) and after initially blaming the
> workqueue changes in 2.6.36 Tejun Heo determined that my problem was
> not the workqueues getting locked up, but that it was cause by an
> exhausted mempool:
> http://marc.info/?l=linux-kernel&m=128655737012549&w=2
> 
> Instrumenting mm/mempool.c and retrying my workload showed that
> fs_bio_set from fs/bio.c looked like the mempool to blame and the code
> in drivers/md/raid1.c to be the misuser:
> http://marc.info/?l=linux-kernel&m=128671179817823&w=2
> 
> I was even able to reproduce this hang with only using a normal RAID1
> md device as swapspace and then using dd to fill a tmpfs until
> swapping was needed:
> http://marc.info/?l=linux-raid&m=128699402805191&w=2
> 
> Looking back in the history of raid1.c and bio.c I found the following
> interesting parts:
> 
>  * the change to allocate more then one bio via bio_clone() is from
> 2005, but it looks like it was OK back then, because at that point the
> fs_bio_set was allocation 256 entries
>  * in 2007 the size of the mempool was changed from 256 to only 2
> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is
> enough, lets scale it down to 2 just to be on the safe side.")
>  * only in 2009 the comment "To make this work, callers must never
> allocate more than 1 bio at the time from this pool. Callers that need
> to allocate more than 1 bio must always submit the previously allocate
> bio for IO before attempting to allocate a new one. Failure to do so
> can cause livelocks under memory pressure." was added to bio_alloc()
> that is the base from my reasoning that raid1.c is broken. (And such a
> comment was not added to bio_clone() although both calls use the same
> mempool)
> 
> So could please look someone into raid1.c to confirm or deny that
> using multiple bio_clone() (one per drive) before submitting them
> together could also cause such deadlocks?
> 
> Thank for looking
> 
> Torsten

Yes, thanks for the report.
This is a real bug exactly as you describe.

This is how I think I will fix it, though it needs a bit of review and
testing before I can be certain.
Also I need to check raid10 etc to see if they can suffer too.

If you can test it I would really appreciate it.

Thanks,
NeilBrown



diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d44a50f..8122dde 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -784,7 +784,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	int i, targets = 0, disks;
 	struct bitmap *bitmap;
 	unsigned long flags;
-	struct bio_list bl;
 	struct page **behind_pages = NULL;
 	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
@@ -892,13 +891,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	 * bios[x] to bio
 	 */
 	disks = conf->raid_disks;
-#if 0
-	{ static int first=1;
-	if (first) printk("First Write sector %llu disks %d\n",
-			  (unsigned long long)r1_bio->sector, disks);
-	first = 0;
-	}
-#endif
  retry_write:
 	blocked_rdev = NULL;
 	rcu_read_lock();
@@ -956,14 +948,15 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	    (behind_pages = alloc_behind_pages(bio)) != NULL)
 		set_bit(R1BIO_BehindIO, &r1_bio->state);
 
-	atomic_set(&r1_bio->remaining, 0);
+	atomic_set(&r1_bio->remaining, targets);
 	atomic_set(&r1_bio->behind_remaining, 0);
 
 	do_barriers = bio->bi_rw & REQ_HARDBARRIER;
 	if (do_barriers)
 		set_bit(R1BIO_Barrier, &r1_bio->state);
 
-	bio_list_init(&bl);
+	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
+				test_bit(R1BIO_BehindIO, &r1_bio->state));
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
 		if (!r1_bio->bios[i])
@@ -995,30 +988,18 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 				atomic_inc(&r1_bio->behind_remaining);
 		}
 
-		atomic_inc(&r1_bio->remaining);
-
-		bio_list_add(&bl, mbio);
+		spin_lock_irqsave(&conf->device_lock, flags);
+		bio_list_add(&conf->pending_bio_list, mbio);
+		blk_plug_device(mddev->queue);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	kfree(behind_pages); /* the behind pages are attached to the bios now */
 
-	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
-				test_bit(R1BIO_BehindIO, &r1_bio->state));
-	spin_lock_irqsave(&conf->device_lock, flags);
-	bio_list_merge(&conf->pending_bio_list, &bl);
-	bio_list_init(&bl);
-
-	blk_plug_device(mddev->queue);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
-
 	/* In case raid1d snuck into freeze_array */
 	wake_up(&conf->wait_barrier);
 
 	if (do_sync)
 		md_wakeup_thread(mddev->thread);
-#if 0
-	while ((bio = bio_list_pop(&bl)) != NULL)
-		generic_make_request(bio);
-#endif
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-18 23:11               ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-10-18 23:11 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Wu Fengguang, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li Shaohua

On Mon, 18 Oct 2010 12:58:17 +0200
Torsten Kaiser <just.for.lkml@googlemail.com> wrote:

> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@suse.de> wrote:
> > Testing shows that this patch seems to work.
> > The test load (essentially kernbench) doesn't deadlock any more, though it
> > does get bogged down thrashing in swap so it doesn't make a lot more
> > progress :-)  I guess that is to be expected.
> 
> I just noticed this thread, as your mail from today pushed it up.
> 
> In your original mail you wrote: " I recently had a customer (running
> 2.6.32) report a deadlock during very intensive IO with lots of
> processes. " and " Some threads that are blocked there, hold some IO
> lock (probably in the filesystem) and are trying to allocate memory
> inside the block device (md/raid1 to be precise) which is allocating
> with GFP_NOIO and has a mempool to fall back on."
> 
> I recently had the same problem (intense IO due to swapstorm created
> by 20 gcc processes hung my system) and after initially blaming the
> workqueue changes in 2.6.36 Tejun Heo determined that my problem was
> not the workqueues getting locked up, but that it was cause by an
> exhausted mempool:
> http://marc.info/?l=linux-kernel&m=128655737012549&w=2
> 
> Instrumenting mm/mempool.c and retrying my workload showed that
> fs_bio_set from fs/bio.c looked like the mempool to blame and the code
> in drivers/md/raid1.c to be the misuser:
> http://marc.info/?l=linux-kernel&m=128671179817823&w=2
> 
> I was even able to reproduce this hang with only using a normal RAID1
> md device as swapspace and then using dd to fill a tmpfs until
> swapping was needed:
> http://marc.info/?l=linux-raid&m=128699402805191&w=2
> 
> Looking back in the history of raid1.c and bio.c I found the following
> interesting parts:
> 
>  * the change to allocate more then one bio via bio_clone() is from
> 2005, but it looks like it was OK back then, because at that point the
> fs_bio_set was allocation 256 entries
>  * in 2007 the size of the mempool was changed from 256 to only 2
> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is
> enough, lets scale it down to 2 just to be on the safe side.")
>  * only in 2009 the comment "To make this work, callers must never
> allocate more than 1 bio at the time from this pool. Callers that need
> to allocate more than 1 bio must always submit the previously allocate
> bio for IO before attempting to allocate a new one. Failure to do so
> can cause livelocks under memory pressure." was added to bio_alloc()
> that is the base from my reasoning that raid1.c is broken. (And such a
> comment was not added to bio_clone() although both calls use the same
> mempool)
> 
> So could please look someone into raid1.c to confirm or deny that
> using multiple bio_clone() (one per drive) before submitting them
> together could also cause such deadlocks?
> 
> Thank for looking
> 
> Torsten

Yes, thanks for the report.
This is a real bug exactly as you describe.

This is how I think I will fix it, though it needs a bit of review and
testing before I can be certain.
Also I need to check raid10 etc to see if they can suffer too.

If you can test it I would really appreciate it.

Thanks,
NeilBrown



diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d44a50f..8122dde 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -784,7 +784,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	int i, targets = 0, disks;
 	struct bitmap *bitmap;
 	unsigned long flags;
-	struct bio_list bl;
 	struct page **behind_pages = NULL;
 	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
@@ -892,13 +891,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	 * bios[x] to bio
 	 */
 	disks = conf->raid_disks;
-#if 0
-	{ static int first=1;
-	if (first) printk("First Write sector %llu disks %d\n",
-			  (unsigned long long)r1_bio->sector, disks);
-	first = 0;
-	}
-#endif
  retry_write:
 	blocked_rdev = NULL;
 	rcu_read_lock();
@@ -956,14 +948,15 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	    (behind_pages = alloc_behind_pages(bio)) != NULL)
 		set_bit(R1BIO_BehindIO, &r1_bio->state);
 
-	atomic_set(&r1_bio->remaining, 0);
+	atomic_set(&r1_bio->remaining, targets);
 	atomic_set(&r1_bio->behind_remaining, 0);
 
 	do_barriers = bio->bi_rw & REQ_HARDBARRIER;
 	if (do_barriers)
 		set_bit(R1BIO_Barrier, &r1_bio->state);
 
-	bio_list_init(&bl);
+	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
+				test_bit(R1BIO_BehindIO, &r1_bio->state));
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
 		if (!r1_bio->bios[i])
@@ -995,30 +988,18 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 				atomic_inc(&r1_bio->behind_remaining);
 		}
 
-		atomic_inc(&r1_bio->remaining);
-
-		bio_list_add(&bl, mbio);
+		spin_lock_irqsave(&conf->device_lock, flags);
+		bio_list_add(&conf->pending_bio_list, mbio);
+		blk_plug_device(mddev->queue);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	kfree(behind_pages); /* the behind pages are attached to the bios now */
 
-	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
-				test_bit(R1BIO_BehindIO, &r1_bio->state));
-	spin_lock_irqsave(&conf->device_lock, flags);
-	bio_list_merge(&conf->pending_bio_list, &bl);
-	bio_list_init(&bl);
-
-	blk_plug_device(mddev->queue);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
-
 	/* In case raid1d snuck into freeze_array */
 	wake_up(&conf->wait_barrier);
 
 	if (do_sync)
 		md_wakeup_thread(mddev->thread);
-#if 0
-	while ((bio = bio_list_pop(&bl)) != NULL)
-		generic_make_request(bio);
-#endif
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18 22:41                   ` Andrew Morton
@ 2010-10-19  0:57                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  0:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Neil Brown, Wu Fengguang, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

> > I think there are two bugs here.
> > The raid1 bug that Torsten mentions is certainly real (and has been around
> > for an embarrassingly long time).
> > The bug that I identified in too_many_isolated is also a real bug and can be
> > triggered without md/raid1 in the mix.
> > So this is not a 'full fix' for every bug in the kernel :-), but it could
> > well be a full fix for this particular bug.
> > 
> 
> Can we just delete the too_many_isolated() logic?  (Crappy comment
> describes what the code does but not why it does it).

if my remember is correct, we got bug report that LTP may makes misterious
OOM killer invocation about 1-2 years ago. because, if too many parocess are in
reclaim path, all of reclaimable pages can be isolated and last reclaimer found
the system don't have any reclaimable pages and lead to invoke OOM killer.
We have strong motivation to avoid false positive oom. then, some discusstion
made this patch.

if my remember is incorrect, I hope Wu or Rik fix me.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  0:57                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  0:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Neil Brown, Wu Fengguang, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

> > I think there are two bugs here.
> > The raid1 bug that Torsten mentions is certainly real (and has been around
> > for an embarrassingly long time).
> > The bug that I identified in too_many_isolated is also a real bug and can be
> > triggered without md/raid1 in the mix.
> > So this is not a 'full fix' for every bug in the kernel :-), but it could
> > well be a full fix for this particular bug.
> > 
> 
> Can we just delete the too_many_isolated() logic?  (Crappy comment
> describes what the code does but not why it does it).

if my remember is correct, we got bug report that LTP may makes misterious
OOM killer invocation about 1-2 years ago. because, if too many parocess are in
reclaim path, all of reclaimable pages can be isolated and last reclaimer found
the system don't have any reclaimable pages and lead to invoke OOM killer.
We have strong motivation to avoid false positive oom. then, some discusstion
made this patch.

if my remember is incorrect, I hope Wu or Rik fix me.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  0:57                     ` KOSAKI Motohiro
@ 2010-10-19  1:15                       ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  1:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Neil Brown, Wu Fengguang, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> > I think there are two bugs here.
>> > The raid1 bug that Torsten mentions is certainly real (and has been around
>> > for an embarrassingly long time).
>> > The bug that I identified in too_many_isolated is also a real bug and can be
>> > triggered without md/raid1 in the mix.
>> > So this is not a 'full fix' for every bug in the kernel :-), but it could
>> > well be a full fix for this particular bug.
>> >
>>
>> Can we just delete the too_many_isolated() logic?  (Crappy comment
>> describes what the code does but not why it does it).
>
> if my remember is correct, we got bug report that LTP may makes misterious
> OOM killer invocation about 1-2 years ago. because, if too many parocess are in
> reclaim path, all of reclaimable pages can be isolated and last reclaimer found
> the system don't have any reclaimable pages and lead to invoke OOM killer.
> We have strong motivation to avoid false positive oom. then, some discusstion
> made this patch.
>
> if my remember is incorrect, I hope Wu or Rik fix me.

AFAIR, it's right.

How about this?

It's rather aggressive throttling than old(ie, it considers not lru
type granularity but zone )
But I think it can prevent unnecessary OOM problem and solve deadlock problem.


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f12ad18..acd6a65 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1961,6 +1961,21 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
        return alloc_flags;
 }

+/*
+ * Are there way too many processes are reclaiming this zone?
+ */
+static int too_many_isolated_zone(struct zone *zone)
+{
+       unsigned long inactive, isolated;
+
+       inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+               zone_page_state(zone, NR_INACTIVE_ANON);
+       isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
+               zone_page_state(zone, NR_ISOLATED_ANON);
+
+       return isolated > inactive;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
        struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2054,10 +2069,11 @@ rebalance:
                goto got_pg;

        /*
-        * If we failed to make any progress reclaiming, then we are
-        * running out of options and have to consider going OOM
+        * If we failed to make any progress reclaiming and there aren't
+        * many parallel reclaiming, then we are unning out of options and
+        * have to consider going OOM
         */
-       if (!did_some_progress) {
+       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
                if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
                        if (oom_killer_disabled)
                                goto nopage;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..f2109af 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,31 +1129,6 @@ int isolate_lru_page(struct page *page)
 }

 /*
- * Are there way too many processes in the direct reclaim path already?
- */
-static int too_many_isolated(struct zone *zone, int file,
-               struct scan_control *sc)
-{
-       unsigned long inactive, isolated;
-
-       if (current_is_kswapd())
-               return 0;
-
-       if (!scanning_global_lru(sc))
-               return 0;
-
-       if (file) {
-               inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-               isolated = zone_page_state(zone, NR_ISOLATED_FILE);
-       } else {
-               inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
-       }
-
-       return isolated > inactive;
-}
-
-/*
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void
@@ -1290,15 +1265,6 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
        unsigned long nr_anon;
        unsigned long nr_file;

-       while (unlikely(too_many_isolated(zone, file, sc))) {
-               congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-               /* We are about to die and free our memory. Return now. */
-               if (fatal_signal_pending(current))
-                       return SWAP_CLUSTER_MAX;
-       }
-
-
        lru_add_drain();
        spin_lock_irq(&zone->lru_lock);




-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  1:15                       ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  1:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Neil Brown, Wu Fengguang, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> > I think there are two bugs here.
>> > The raid1 bug that Torsten mentions is certainly real (and has been around
>> > for an embarrassingly long time).
>> > The bug that I identified in too_many_isolated is also a real bug and can be
>> > triggered without md/raid1 in the mix.
>> > So this is not a 'full fix' for every bug in the kernel :-), but it could
>> > well be a full fix for this particular bug.
>> >
>>
>> Can we just delete the too_many_isolated() logic?  (Crappy comment
>> describes what the code does but not why it does it).
>
> if my remember is correct, we got bug report that LTP may makes misterious
> OOM killer invocation about 1-2 years ago. because, if too many parocess are in
> reclaim path, all of reclaimable pages can be isolated and last reclaimer found
> the system don't have any reclaimable pages and lead to invoke OOM killer

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  1:15                       ` Minchan Kim
@ 2010-10-19  1:21                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  1:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Andrew Morton, Neil Brown, Wu Fengguang,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> > I think there are two bugs here.
> >> > The raid1 bug that Torsten mentions is certainly real (and has been around
> >> > for an embarrassingly long time).
> >> > The bug that I identified in too_many_isolated is also a real bug and can be
> >> > triggered without md/raid1 in the mix.
> >> > So this is not a 'full fix' for every bug in the kernel :-), but it could
> >> > well be a full fix for this particular bug.
> >> >
> >>
> >> Can we just delete the too_many_isolated() logic?  (Crappy comment
> >> describes what the code does but not why it does it).
> >
> > if my remember is correct, we got bug report that LTP may makes misterious
> > OOM killer invocation about 1-2 years ago. because, if too many parocess are in
> > reclaim path, all of reclaimable pages can be isolated and last reclaimer found
> > the system don't have any reclaimable pages and lead to invoke OOM killer.
> > We have strong motivation to avoid false positive oom. then, some discusstion
> > made this patch.
> >
> > if my remember is incorrect, I hope Wu or Rik fix me.
> 
> AFAIR, it's right.
> 
> How about this?
> 
> It's rather aggressive throttling than old(ie, it considers not lru
> type granularity but zone )
> But I think it can prevent unnecessary OOM problem and solve deadlock problem.

Can you please elaborate your intention? Do you think Wu's approach is wrong?




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  1:21                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  1:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Andrew Morton, Neil Brown, Wu Fengguang,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> > I think there are two bugs here.
> >> > The raid1 bug that Torsten mentions is certainly real (and has been around
> >> > for an embarrassingly long time).
> >> > The bug that I identified in too_many_isolated is also a real bug and can be
> >> > triggered without md/raid1 in the mix.
> >> > So this is not a 'full fix' for every bug in the kernel :-), but it could
> >> > well be a full fix for this particular bug.
> >> >
> >>
> >> Can we just delete the too_many_isolated() logic?  (Crappy comment
> >> describes what the code does but not why it does it).
> >
> > if my remember is correct, we got bug report that LTP may makes misterious
> > OOM killer invocation about 1-2 years ago. because, if too many parocess are in
> > reclaim path, all of reclaimable pages can be isolated and last reclaimer found
> > the system don't have any reclaimable pages and lead to invoke OOM killer.
> > We have strong motivation to avoid false positive oom. then, some discusstion
> > made this patch.
> >
> > if my remember is incorrect, I hope Wu or Rik fix me.
> 
> AFAIR, it's right.
> 
> How about this?
> 
> It's rather aggressive throttling than old(ie, it considers not lru
> type granularity but zone )
> But I think it can prevent unnecessary OOM problem and solve deadlock problem.

Can you please elaborate your intention? Do you think Wu's approach is wrong?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  1:21                         ` KOSAKI Motohiro
@ 2010-10-19  1:32                           ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  1:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Neil Brown, Wu Fengguang, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 10:21 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> > I think there are two bugs here.
>> >> > The raid1 bug that Torsten mentions is certainly real (and has been around
>> >> > for an embarrassingly long time).
>> >> > The bug that I identified in too_many_isolated is also a real bug and can be
>> >> > triggered without md/raid1 in the mix.
>> >> > So this is not a 'full fix' for every bug in the kernel :-), but it could
>> >> > well be a full fix for this particular bug.
>> >> >
>> >>
>> >> Can we just delete the too_many_isolated() logic?  (Crappy comment
>> >> describes what the code does but not why it does it).
>> >
>> > if my remember is correct, we got bug report that LTP may makes misterious
>> > OOM killer invocation about 1-2 years ago. because, if too many parocess are in
>> > reclaim path, all of reclaimable pages can be isolated and last reclaimer found
>> > the system don't have any reclaimable pages and lead to invoke OOM killer.
>> > We have strong motivation to avoid false positive oom. then, some discusstion
>> > made this patch.
>> >
>> > if my remember is incorrect, I hope Wu or Rik fix me.
>>
>> AFAIR, it's right.
>>
>> How about this?
>>
>> It's rather aggressive throttling than old(ie, it considers not lru
>> type granularity but zone )
>> But I think it can prevent unnecessary OOM problem and solve deadlock problem.
>
> Can you please elaborate your intention? Do you think Wu's approach is wrong?

No. I think Wu's patch may work well. But I agree Andrew.
Couldn't we remove the too_many_isolated logic? If it is, we can solve
the problem simply.
But If we remove the logic, we will meet long time ago problem, again.
So my patch's intention is to prevent OOM and deadlock problem with
simple patch without adding new heuristic in too_many_isolated.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  1:32                           ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  1:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Neil Brown, Wu Fengguang, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 10:21 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> > I think there are two bugs here.
>> >> > The raid1 bug that Torsten mentions is certainly real (and has been around
>> >> > for an embarrassingly long time).
>> >> > The bug that I identified in too_many_isolated is also a real bug and can be
>> >> > triggered without md/raid1 in the mix.
>> >> > So this is not a 'full fix' for every bug in the kernel :-), but it could
>> >> > well be a full fix for this particular bug.
>> >> >
>> >>
>> >> Can we just delete the too_many_isolated() logic?  (Crappy comment
>> >> describes what the code does but not why it does it).
>> >
>> > if my remember is correct, we got bug report that LTP may makes misterious
>> > OOM killer invocation about 1-2 years ago. because, if too many parocess are in
>> > reclaim path, all of reclaimable pages can be isolated and last reclaimer found
>> > the system don't have any reclaimable pages and lead to invoke OOM killer.
>> > We have strong motivation to avoid false positive oom. then, some discusstion
>> > made this patch.
>> >
>> > if my remember is incorrect, I hope Wu or Rik fix me.
>>
>> AFAIR, it's right.
>>
>> How about this?
>>
>> It's rather aggressive throttling than old(ie, it considers not lru
>> type granularity but zone )
>> But I think it can prevent unnecessary OOM problem and solve deadlock problem.
>
> Can you please elaborate your intention? Do you think Wu's approach is wrong?

No. I think Wu's patch may work well. But I agree Andrew.
Couldn't we remove the too_many_isolated logic? If it is, we can solve
the problem simply.
But If we remove the logic, we will meet long time ago problem, again.
So my patch's intention is to prevent OOM and deadlock problem with
simple patch without adding new heuristic in too_many_isolated.


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  1:32                           ` Minchan Kim
@ 2010-10-19  2:03                             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  2:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Andrew Morton, Neil Brown, Wu Fengguang,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> On Tue, Oct 19, 2010 at 10:21 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
> >> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> >> > I think there are two bugs here.
> >> >> > The raid1 bug that Torsten mentions is certainly real (and has been around
> >> >> > for an embarrassingly long time).
> >> >> > The bug that I identified in too_many_isolated is also a real bug and can be
> >> >> > triggered without md/raid1 in the mix.
> >> >> > So this is not a 'full fix' for every bug in the kernel :-), but it could
> >> >> > well be a full fix for this particular bug.
> >> >> >
> >> >>
> >> >> Can we just delete the too_many_isolated() logic?  (Crappy comment
> >> >> describes what the code does but not why it does it).
> >> >
> >> > if my remember is correct, we got bug report that LTP may makes misterious
> >> > OOM killer invocation about 1-2 years ago. because, if too many parocess are in
> >> > reclaim path, all of reclaimable pages can be isolated and last reclaimer found
> >> > the system don't have any reclaimable pages and lead to invoke OOM killer.
> >> > We have strong motivation to avoid false positive oom. then, some discusstion
> >> > made this patch.
> >> >
> >> > if my remember is incorrect, I hope Wu or Rik fix me.
> >>
> >> AFAIR, it's right.
> >>
> >> How about this?
> >>
> >> It's rather aggressive throttling than old(ie, it considers not lru
> >> type granularity but zone )
> >> But I think it can prevent unnecessary OOM problem and solve deadlock problem.
> >
> > Can you please elaborate your intention? Do you think Wu's approach is wrong?
> 
> No. I think Wu's patch may work well. But I agree Andrew.
> Couldn't we remove the too_many_isolated logic? If it is, we can solve
> the problem simply.
> But If we remove the logic, we will meet long time ago problem, again.
> So my patch's intention is to prevent OOM and deadlock problem with
> simple patch without adding new heuristic in too_many_isolated.

But your patch is much false positive/negative chance because isolated pages timing 
and too_many_isolated_zone() call site are in far distance place.
So, if anyone don't say Wu's one is wrong, I like his one.





^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  2:03                             ` KOSAKI Motohiro
  0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  2:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Andrew Morton, Neil Brown, Wu Fengguang,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> On Tue, Oct 19, 2010 at 10:21 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
> >> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> >> > I think there are two bugs here.
> >> >> > The raid1 bug that Torsten mentions is certainly real (and has been around
> >> >> > for an embarrassingly long time).
> >> >> > The bug that I identified in too_many_isolated is also a real bug and can be
> >> >> > triggered without md/raid1 in the mix.
> >> >> > So this is not a 'full fix' for every bug in the kernel :-), but it could
> >> >> > well be a full fix for this particular bug.
> >> >> >
> >> >>
> >> >> Can we just delete the too_many_isolated() logic?  (Crappy comment
> >> >> describes what the code does but not why it does it).
> >> >
> >> > if my remember is correct, we got bug report that LTP may makes misterious
> >> > OOM killer invocation about 1-2 years ago. because, if too many parocess are in
> >> > reclaim path, all of reclaimable pages can be isolated and last reclaimer found
> >> > the system don't have any reclaimable pages and lead to invoke OOM killer.
> >> > We have strong motivation to avoid false positive oom. then, some discusstion
> >> > made this patch.
> >> >
> >> > if my remember is incorrect, I hope Wu or Rik fix me.
> >>
> >> AFAIR, it's right.
> >>
> >> How about this?
> >>
> >> It's rather aggressive throttling than old(ie, it considers not lru
> >> type granularity but zone )
> >> But I think it can prevent unnecessary OOM problem and solve deadlock problem.
> >
> > Can you please elaborate your intention? Do you think Wu's approach is wrong?
> 
> No. I think Wu's patch may work well. But I agree Andrew.
> Couldn't we remove the too_many_isolated logic? If it is, we can solve
> the problem simply.
> But If we remove the logic, we will meet long time ago problem, again.
> So my patch's intention is to prevent OOM and deadlock problem with
> simple patch without adding new heuristic in too_many_isolated.

But your patch is much false positive/negative chance because isolated pages timing 
and too_many_isolated_zone() call site are in far distance place.
So, if anyone don't say Wu's one is wrong, I like his one.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  2:03                             ` KOSAKI Motohiro
@ 2010-10-19  2:16                               ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  2:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Neil Brown, Wu Fengguang, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 11:03 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Tue, Oct 19, 2010 at 10:21 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
>> >> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> >> > I think there are two bugs here.
>> >> >> > The raid1 bug that Torsten mentions is certainly real (and has been around
>> >> >> > for an embarrassingly long time).
>> >> >> > The bug that I identified in too_many_isolated is also a real bug and can be
>> >> >> > triggered without md/raid1 in the mix.
>> >> >> > So this is not a 'full fix' for every bug in the kernel :-), but it could
>> >> >> > well be a full fix for this particular bug.
>> >> >> >
>> >> >>
>> >> >> Can we just delete the too_many_isolated() logic?  (Crappy comment
>> >> >> describes what the code does but not why it does it).
>> >> >
>> >> > if my remember is correct, we got bug report that LTP may makes misterious
>> >> > OOM killer invocation about 1-2 years ago. because, if too many parocess are in
>> >> > reclaim path, all of reclaimable pages can be isolated and last reclaimer found
>> >> > the system don't have any reclaimable pages and lead to invoke OOM killer.
>> >> > We have strong motivation to avoid false positive oom. then, some discusstion
>> >> > made this patch.
>> >> >
>> >> > if my remember is incorrect, I hope Wu or Rik fix me.
>> >>
>> >> AFAIR, it's right.
>> >>
>> >> How about this?
>> >>
>> >> It's rather aggressive throttling than old(ie, it considers not lru
>> >> type granularity but zone )
>> >> But I think it can prevent unnecessary OOM problem and solve deadlock problem.
>> >
>> > Can you please elaborate your intention? Do you think Wu's approach is wrong?
>>
>> No. I think Wu's patch may work well. But I agree Andrew.
>> Couldn't we remove the too_many_isolated logic? If it is, we can solve
>> the problem simply.
>> But If we remove the logic, we will meet long time ago problem, again.
>> So my patch's intention is to prevent OOM and deadlock problem with
>> simple patch without adding new heuristic in too_many_isolated.
>
> But your patch is much false positive/negative chance because isolated pages timing
> and too_many_isolated_zone() call site are in far distance place.

Yes.
How about the returning *did_some_progress can imply too_many_isolated
fail by using MSB or new variable?
Then, page_allocator can check it whether it causes read reclaim fail
or parallel reclaim.
The point is let's throttle without holding FS/IO lock.

> So, if anyone don't say Wu's one is wrong, I like his one.
>

I am not against it and just want to solve the problem without adding new logic.



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  2:16                               ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  2:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Neil Brown, Wu Fengguang, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 11:03 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Tue, Oct 19, 2010 at 10:21 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro
>> >> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> >> > I think there are two bugs here.
>> >> >> > The raid1 bug that Torsten mentions is certainly real (and has been around
>> >> >> > for an embarrassingly long time).
>> >> >> > The bug that I identified in too_many_isolated is also a real bug and can be
>> >> >> > triggered without md/raid1 in the mix.
>> >> >> > So this is not a 'full fix' for every bug in the kernel :-), but it could
>> >> >> > well be a full fix for this particular bug.
>> >> >> >
>> >> >>
>> >> >> Can we just delete the too_many_isolated() logic?  (Crappy comment
>> >> >> describes what the code does but not why it does it).
>> >> >
>> >> > if my remember is correct, we got bug report that LTP may makes misterious
>> >> > OOM killer invocation about 1-2 years ago. because, if too many parocess are in
>> >> > reclaim path, all of reclaimable pages can be isolated and last reclaimer found
>> >> > the system don't have any reclaimable pages and lead to invoke OOM killer.
>> >> > We have strong motivation to avoid false positive oom. then, some discusstion
>> >> > made this patch.
>> >> >
>> >> > if my remember is incorrect, I hope Wu or Rik fix me.
>> >>
>> >> AFAIR, it's right.
>> >>
>> >> How about this?
>> >>
>> >> It's rather aggressive throttling than old(ie, it considers not lru
>> >> type granularity but zone )
>> >> But I think it can prevent unnecessary OOM problem and solve deadlock problem.
>> >
>> > Can you please elaborate your intention? Do you think Wu's approach is wrong?
>>
>> No. I think Wu's patch may work well. But I agree Andrew.
>> Couldn't we remove the too_many_isolated logic? If it is, we can solve
>> the problem simply.
>> But If we remove the logic, we will meet long time ago problem, again.
>> So my patch's intention is to prevent OOM and deadlock problem with
>> simple patch without adding new heuristic in too_many_isolated.
>
> But your patch is much false positive/negative chance because isolated pages timing
> and too_many_isolated_zone() call site are in far distance place.

Yes.
How about the returning *did_some_progress can imply too_many_isolated
fail by using MSB or new variable?
Then, page_allocator can check it whether it causes read reclaim fail
or parallel reclaim.
The point is let's throttle without holding FS/IO lock.

> So, if anyone don't say Wu's one is wrong, I like his one.
>

I am not against it and just want to solve the problem without adding new logic.



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18 22:41                   ` Andrew Morton
@ 2010-10-19  2:24                     ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-19  2:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Neil Brown, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 06:41:37AM +0800, Andrew Morton wrote:
> On Tue, 19 Oct 2010 09:31:42 +1100
> Neil Brown <neilb@suse.de> wrote:
> 
> > On Mon, 18 Oct 2010 14:58:59 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Tue, 19 Oct 2010 00:15:04 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Neil find that if too_many_isolated() returns true while performing
> > > > direct reclaim we can end up waiting for other threads to complete their
> > > > direct reclaim.  If those threads are allowed to enter the FS or IO to
> > > > free memory, but this thread is not, then it is possible that those
> > > > threads will be waiting on this thread and so we get a circular
> > > > deadlock.
> > > > 
> > > > some task enters direct reclaim with GFP_KERNEL
> > > >   => too_many_isolated() false
> > > >     => vmscan and run into dirty pages
> > > >       => pageout()
> > > >         => take some FS lock
> > > > 	  => fs/block code does GFP_NOIO allocation
> > > > 	    => enter direct reclaim again
> > > > 	      => too_many_isolated() true
> > > > 		=> waiting for others to progress, however the other
> > > > 		   tasks may be circular waiting for the FS lock..
> 
> I'm assuming that the last four "=>"'s here should have been indented
> another stop.

Yup. I'll fix it in next post.

> > > > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> > > > priority than normal ones, by honouring them higher throttle threshold.
> > > > 
> > > > Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
> > > > progress. They will be blocked only when there are too many concurrent
> > > > !GFP_IOFS reclaims, however that's very unlikely because the IO-less
> > > > direct reclaims is able to progress much more faster, and they won't
> > > > deadlock each other. The threshold is raised high enough for them, so
> > > > that there can be sufficient parallel progress of !GFP_IOFS reclaims.
> > > 
> > > I'm not sure that this is really a full fix.  Torsten's analysis does
> > > appear to point at the real bug: raid1 has code paths which allocate
> > > more than a single element from a mempool without starting IO against
> > > previous elements.
> > 
> > ... point at "a" real bug.
> > 
> > I think there are two bugs here.
> > The raid1 bug that Torsten mentions is certainly real (and has been around
> > for an embarrassingly long time).
> > The bug that I identified in too_many_isolated is also a real bug and can be
> > triggered without md/raid1 in the mix.
> > So this is not a 'full fix' for every bug in the kernel :-),

> > but it could well be a full fix for this particular bug.

Yeah it aims to be a full fix for one bug.

> Can we just delete the too_many_isolated() logic?  (Crappy comment

If the two cond_resched() calls can be removed from
shrink_page_list(), the major cause of too many pages being
isolated will be gone. However the writeback-waiting logic after
should_reclaim_stall() will also block the direct reclaimer for long
time with pages isolated, which may bite under pathological conditions.

> describes what the code does but not why it does it).

Good point. The comment could be improved as follows.

Thanks,
Fengguang

---
Subject: vmscan: comment too_many_isolated()
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Tue Oct 19 09:53:23 CST 2010

Comment "Why it's doing so" rather than "What it does"
as proposed by Andrew Morton.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/vmscan.c	2010-10-19 09:29:44.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-10-19 10:21:41.000000000 +0800
@@ -1142,7 +1142,11 @@ int isolate_lru_page(struct page *page)
 }
 
 /*
- * Are there way too many processes in the direct reclaim path already?
+ * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
+ * then get resheduled. When there are massive number of tasks doing page
+ * allocation, such sleeping direct reclaimers may keep piling up on each CPU,
+ * the LRU list will go small and be scanned faster than necessary, leading to
+ * unnecessary swapping, thrashing and OOM.
  */
 static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  2:24                     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-19  2:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Neil Brown, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 06:41:37AM +0800, Andrew Morton wrote:
> On Tue, 19 Oct 2010 09:31:42 +1100
> Neil Brown <neilb@suse.de> wrote:
> 
> > On Mon, 18 Oct 2010 14:58:59 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Tue, 19 Oct 2010 00:15:04 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Neil find that if too_many_isolated() returns true while performing
> > > > direct reclaim we can end up waiting for other threads to complete their
> > > > direct reclaim.  If those threads are allowed to enter the FS or IO to
> > > > free memory, but this thread is not, then it is possible that those
> > > > threads will be waiting on this thread and so we get a circular
> > > > deadlock.
> > > > 
> > > > some task enters direct reclaim with GFP_KERNEL
> > > >   => too_many_isolated() false
> > > >     => vmscan and run into dirty pages
> > > >       => pageout()
> > > >         => take some FS lock
> > > > 	  => fs/block code does GFP_NOIO allocation
> > > > 	    => enter direct reclaim again
> > > > 	      => too_many_isolated() true
> > > > 		=> waiting for others to progress, however the other
> > > > 		   tasks may be circular waiting for the FS lock..
> 
> I'm assuming that the last four "=>"'s here should have been indented
> another stop.

Yup. I'll fix it in next post.

> > > > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
> > > > priority than normal ones, by honouring them higher throttle threshold.
> > > > 
> > > > Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
> > > > progress. They will be blocked only when there are too many concurrent
> > > > !GFP_IOFS reclaims, however that's very unlikely because the IO-less
> > > > direct reclaims is able to progress much more faster, and they won't
> > > > deadlock each other. The threshold is raised high enough for them, so
> > > > that there can be sufficient parallel progress of !GFP_IOFS reclaims.
> > > 
> > > I'm not sure that this is really a full fix.  Torsten's analysis does
> > > appear to point at the real bug: raid1 has code paths which allocate
> > > more than a single element from a mempool without starting IO against
> > > previous elements.
> > 
> > ... point at "a" real bug.
> > 
> > I think there are two bugs here.
> > The raid1 bug that Torsten mentions is certainly real (and has been around
> > for an embarrassingly long time).
> > The bug that I identified in too_many_isolated is also a real bug and can be
> > triggered without md/raid1 in the mix.
> > So this is not a 'full fix' for every bug in the kernel :-),

> > but it could well be a full fix for this particular bug.

Yeah it aims to be a full fix for one bug.

> Can we just delete the too_many_isolated() logic?  (Crappy comment

If the two cond_resched() calls can be removed from
shrink_page_list(), the major cause of too many pages being
isolated will be gone. However the writeback-waiting logic after
should_reclaim_stall() will also block the direct reclaimer for long
time with pages isolated, which may bite under pathological conditions.

> describes what the code does but not why it does it).

Good point. The comment could be improved as follows.

Thanks,
Fengguang

---
Subject: vmscan: comment too_many_isolated()
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Tue Oct 19 09:53:23 CST 2010

Comment "Why it's doing so" rather than "What it does"
as proposed by Andrew Morton.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/vmscan.c	2010-10-19 09:29:44.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-10-19 10:21:41.000000000 +0800
@@ -1142,7 +1142,11 @@ int isolate_lru_page(struct page *page)
 }
 
 /*
- * Are there way too many processes in the direct reclaim path already?
+ * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
+ * then get resheduled. When there are massive number of tasks doing page
+ * allocation, such sleeping direct reclaimers may keep piling up on each CPU,
+ * the LRU list will go small and be scanned faster than necessary, leading to
+ * unnecessary swapping, thrashing and OOM.
  */
 static int too_many_isolated(struct zone *zone, int file,
 		struct scan_control *sc)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  1:15                       ` Minchan Kim
@ 2010-10-19  2:35                         ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-19  2:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

> @@ -2054,10 +2069,11 @@ rebalance:
>                 goto got_pg;
> 
>         /*
> -        * If we failed to make any progress reclaiming, then we are
> -        * running out of options and have to consider going OOM
> +        * If we failed to make any progress reclaiming and there aren't
> +        * many parallel reclaiming, then we are unning out of options and
> +        * have to consider going OOM
>          */
> -       if (!did_some_progress) {
> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>                         if (oom_killer_disabled)
>                                 goto nopage;

This is simply wrong.

It disabled this block for 99% system because there won't be enough
tasks to make (!too_many_isolated_zone == true). As a result the LRU
will be scanned like mad and no task get OOMed when it should be.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  2:35                         ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-19  2:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

> @@ -2054,10 +2069,11 @@ rebalance:
>                 goto got_pg;
> 
>         /*
> -        * If we failed to make any progress reclaiming, then we are
> -        * running out of options and have to consider going OOM
> +        * If we failed to make any progress reclaiming and there aren't
> +        * many parallel reclaiming, then we are unning out of options and
> +        * have to consider going OOM
>          */
> -       if (!did_some_progress) {
> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>                         if (oom_killer_disabled)
>                                 goto nopage;

This is simply wrong.

It disabled this block for 99% system because there won't be enough
tasks to make (!too_many_isolated_zone == true). As a result the LRU
will be scanned like mad and no task get OOMed when it should be.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  2:24                     ` Wu Fengguang
@ 2010-10-19  2:37                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  2:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

> ---
> Subject: vmscan: comment too_many_isolated()
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Tue Oct 19 09:53:23 CST 2010
> 
> Comment "Why it's doing so" rather than "What it does"
> as proposed by Andrew Morton.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/vmscan.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/vmscan.c	2010-10-19 09:29:44.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-10-19 10:21:41.000000000 +0800
> @@ -1142,7 +1142,11 @@ int isolate_lru_page(struct page *page)
>  }
>  
>  /*
> - * Are there way too many processes in the direct reclaim path already?
> + * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
> + * then get resheduled. When there are massive number of tasks doing page
> + * allocation, such sleeping direct reclaimers may keep piling up on each CPU,
> + * the LRU list will go small and be scanned faster than necessary, leading to
> + * unnecessary swapping, thrashing and OOM.
>   */
>  static int too_many_isolated(struct zone *zone, int file,
>  		struct scan_control *sc)

nice!
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  2:37                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  2:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

> ---
> Subject: vmscan: comment too_many_isolated()
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Tue Oct 19 09:53:23 CST 2010
> 
> Comment "Why it's doing so" rather than "What it does"
> as proposed by Andrew Morton.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/vmscan.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/vmscan.c	2010-10-19 09:29:44.000000000 +0800
> +++ linux-next/mm/vmscan.c	2010-10-19 10:21:41.000000000 +0800
> @@ -1142,7 +1142,11 @@ int isolate_lru_page(struct page *page)
>  }
>  
>  /*
> - * Are there way too many processes in the direct reclaim path already?
> + * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
> + * then get resheduled. When there are massive number of tasks doing page
> + * allocation, such sleeping direct reclaimers may keep piling up on each CPU,
> + * the LRU list will go small and be scanned faster than necessary, leading to
> + * unnecessary swapping, thrashing and OOM.
>   */
>  static int too_many_isolated(struct zone *zone, int file,
>  		struct scan_control *sc)

nice!
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  2:24                     ` Wu Fengguang
@ 2010-10-19  2:37                       ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  2:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Neil Brown, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 11:24 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Oct 19, 2010 at 06:41:37AM +0800, Andrew Morton wrote:
>> On Tue, 19 Oct 2010 09:31:42 +1100
>> Neil Brown <neilb@suse.de> wrote:
>>
>> > On Mon, 18 Oct 2010 14:58:59 -0700
>> > Andrew Morton <akpm@linux-foundation.org> wrote:
>> >
>> > > On Tue, 19 Oct 2010 00:15:04 +0800
>> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >
>> > > > Neil find that if too_many_isolated() returns true while performing
>> > > > direct reclaim we can end up waiting for other threads to complete their
>> > > > direct reclaim.  If those threads are allowed to enter the FS or IO to
>> > > > free memory, but this thread is not, then it is possible that those
>> > > > threads will be waiting on this thread and so we get a circular
>> > > > deadlock.
>> > > >
>> > > > some task enters direct reclaim with GFP_KERNEL
>> > > >   => too_many_isolated() false
>> > > >     => vmscan and run into dirty pages
>> > > >       => pageout()
>> > > >         => take some FS lock
>> > > >           => fs/block code does GFP_NOIO allocation
>> > > >             => enter direct reclaim again
>> > > >               => too_many_isolated() true
>> > > >                 => waiting for others to progress, however the other
>> > > >                    tasks may be circular waiting for the FS lock..
>>
>> I'm assuming that the last four "=>"'s here should have been indented
>> another stop.
>
> Yup. I'll fix it in next post.
>
>> > > > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
>> > > > priority than normal ones, by honouring them higher throttle threshold.
>> > > >
>> > > > Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
>> > > > progress. They will be blocked only when there are too many concurrent
>> > > > !GFP_IOFS reclaims, however that's very unlikely because the IO-less
>> > > > direct reclaims is able to progress much more faster, and they won't
>> > > > deadlock each other. The threshold is raised high enough for them, so
>> > > > that there can be sufficient parallel progress of !GFP_IOFS reclaims.
>> > >
>> > > I'm not sure that this is really a full fix.  Torsten's analysis does
>> > > appear to point at the real bug: raid1 has code paths which allocate
>> > > more than a single element from a mempool without starting IO against
>> > > previous elements.
>> >
>> > ... point at "a" real bug.
>> >
>> > I think there are two bugs here.
>> > The raid1 bug that Torsten mentions is certainly real (and has been around
>> > for an embarrassingly long time).
>> > The bug that I identified in too_many_isolated is also a real bug and can be
>> > triggered without md/raid1 in the mix.
>> > So this is not a 'full fix' for every bug in the kernel :-),
>
>> > but it could well be a full fix for this particular bug.
>
> Yeah it aims to be a full fix for one bug.
>
>> Can we just delete the too_many_isolated() logic?  (Crappy comment
>
> If the two cond_resched() calls can be removed from
> shrink_page_list(), the major cause of too many pages being
> isolated will be gone. However the writeback-waiting logic after
> should_reclaim_stall() will also block the direct reclaimer for long
> time with pages isolated, which may bite under pathological conditions.
>
>> describes what the code does but not why it does it).
>
> Good point. The comment could be improved as follows.
>
> Thanks,
> Fengguang
>
> ---
> Subject: vmscan: comment too_many_isolated()
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Tue Oct 19 09:53:23 CST 2010
>
> Comment "Why it's doing so" rather than "What it does"
> as proposed by Andrew Morton.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  2:37                       ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  2:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Neil Brown, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 11:24 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Oct 19, 2010 at 06:41:37AM +0800, Andrew Morton wrote:
>> On Tue, 19 Oct 2010 09:31:42 +1100
>> Neil Brown <neilb@suse.de> wrote:
>>
>> > On Mon, 18 Oct 2010 14:58:59 -0700
>> > Andrew Morton <akpm@linux-foundation.org> wrote:
>> >
>> > > On Tue, 19 Oct 2010 00:15:04 +0800
>> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >
>> > > > Neil find that if too_many_isolated() returns true while performing
>> > > > direct reclaim we can end up waiting for other threads to complete their
>> > > > direct reclaim.  If those threads are allowed to enter the FS or IO to
>> > > > free memory, but this thread is not, then it is possible that those
>> > > > threads will be waiting on this thread and so we get a circular
>> > > > deadlock.
>> > > >
>> > > > some task enters direct reclaim with GFP_KERNEL
>> > > >   => too_many_isolated() false
>> > > >     => vmscan and run into dirty pages
>> > > >       => pageout()
>> > > >         => take some FS lock
>> > > >           => fs/block code does GFP_NOIO allocation
>> > > >             => enter direct reclaim again
>> > > >               => too_many_isolated() true
>> > > >                 => waiting for others to progress, however the other
>> > > >                    tasks may be circular waiting for the FS lock..
>>
>> I'm assuming that the last four "=>"'s here should have been indented
>> another stop.
>
> Yup. I'll fix it in next post.
>
>> > > > The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
>> > > > priority than normal ones, by honouring them higher throttle threshold.
>> > > >
>> > > > Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to
>> > > > progress. They will be blocked only when there are too many concurrent
>> > > > !GFP_IOFS reclaims, however that's very unlikely because the IO-less
>> > > > direct reclaims is able to progress much more faster, and they won't
>> > > > deadlock each other. The threshold is raised high enough for them, so
>> > > > that there can be sufficient parallel progress of !GFP_IOFS reclaims.
>> > >
>> > > I'm not sure that this is really a full fix.  Torsten's analysis does
>> > > appear to point at the real bug: raid1 has code paths which allocate
>> > > more than a single element from a mempool without starting IO against
>> > > previous elements.
>> >
>> > ... point at "a" real bug.
>> >
>> > I think there are two bugs here.
>> > The raid1 bug that Torsten mentions is certainly real (and has been around
>> > for an embarrassingly long time).
>> > The bug that I identified in too_many_isolated is also a real bug and can be
>> > triggered without md/raid1 in the mix.
>> > So this is not a 'full fix' for every bug in the kernel :-),
>
>> > but it could well be a full fix for this particular bug.
>
> Yeah it aims to be a full fix for one bug.
>
>> Can we just delete the too_many_isolated() logic?  (Crappy comment
>
> If the two cond_resched() calls can be removed from
> shrink_page_list(), the major cause of too many pages being
> isolated will be gone. However the writeback-waiting logic after
> should_reclaim_stall() will also block the direct reclaimer for long
> time with pages isolated, which may bite under pathological conditions.
>
>> describes what the code does but not why it does it).
>
> Good point. The comment could be improved as follows.
>
> Thanks,
> Fengguang
>
> ---
> Subject: vmscan: comment too_many_isolated()
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Tue Oct 19 09:53:23 CST 2010
>
> Comment "Why it's doing so" rather than "What it does"
> as proposed by Andrew Morton.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  2:35                         ` Wu Fengguang
@ 2010-10-19  2:52                           ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  2:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

Hi Wu,

On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> @@ -2054,10 +2069,11 @@ rebalance:
>>                 goto got_pg;
>>
>>         /*
>> -        * If we failed to make any progress reclaiming, then we are
>> -        * running out of options and have to consider going OOM
>> +        * If we failed to make any progress reclaiming and there aren't
>> +        * many parallel reclaiming, then we are unning out of options and
>> +        * have to consider going OOM
>>          */
>> -       if (!did_some_progress) {
>> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>>                         if (oom_killer_disabled)
>>                                 goto nopage;
>
> This is simply wrong.
>
> It disabled this block for 99% system because there won't be enough
> tasks to make (!too_many_isolated_zone == true). As a result the LRU
> will be scanned like mad and no task get OOMed when it should be.

If !too_many_isolated_zone is false, it means there are already many
direct reclaiming tasks.
So they could exit reclaim path and !too_many_isolated_zone will be true.
What am I missing now?


> Thanks,
> Fengguang
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  2:52                           ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  2:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

Hi Wu,

On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> @@ -2054,10 +2069,11 @@ rebalance:
>>                 goto got_pg;
>>
>>         /*
>> -        * If we failed to make any progress reclaiming, then we are
>> -        * running out of options and have to consider going OOM
>> +        * If we failed to make any progress reclaiming and there aren't
>> +        * many parallel reclaiming, then we are unning out of options and
>> +        * have to consider going OOM
>>          */
>> -       if (!did_some_progress) {
>> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>>                         if (oom_killer_disabled)
>>                                 goto nopage;
>
> This is simply wrong.
>
> It disabled this block for 99% system because there won't be enough
> tasks to make (!too_many_isolated_zone == true). As a result the LRU
> will be scanned like mad and no task get OOMed when it should be.

If !too_many_isolated_zone is false, it means there are already many
direct reclaiming tasks.
So they could exit reclaim path and !too_many_isolated_zone will be true.
What am I missing now?


> Thanks,
> Fengguang
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  2:16                               ` Minchan Kim
@ 2010-10-19  2:54                                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  2:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Andrew Morton, Neil Brown, Wu Fengguang,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> >> > Can you please elaborate your intention? Do you think Wu's approach is wrong?
> >>
> >> No. I think Wu's patch may work well. But I agree Andrew.
> >> Couldn't we remove the too_many_isolated logic? If it is, we can solve
> >> the problem simply.
> >> But If we remove the logic, we will meet long time ago problem, again.
> >> So my patch's intention is to prevent OOM and deadlock problem with
> >> simple patch without adding new heuristic in too_many_isolated.
> >
> > But your patch is much false positive/negative chance because isolated pages timing
> > and too_many_isolated_zone() call site are in far distance place.
> 
> Yes.
> How about the returning *did_some_progress can imply too_many_isolated
> fail by using MSB or new variable?
> Then, page_allocator can check it whether it causes read reclaim fail
> or parallel reclaim.
> The point is let's throttle without holding FS/IO lock.

Wu's version sleep in shrink_inactive_list(). your version sleep in __alloc_pages_slowpath()
by wait_iff_congested(). both don't release lock, I think.
But, if alloc_pages() return fail if GFP_NOIO, we introduce another issue.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  2:54                                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  2:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Andrew Morton, Neil Brown, Wu Fengguang,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> >> > Can you please elaborate your intention? Do you think Wu's approach is wrong?
> >>
> >> No. I think Wu's patch may work well. But I agree Andrew.
> >> Couldn't we remove the too_many_isolated logic? If it is, we can solve
> >> the problem simply.
> >> But If we remove the logic, we will meet long time ago problem, again.
> >> So my patch's intention is to prevent OOM and deadlock problem with
> >> simple patch without adding new heuristic in too_many_isolated.
> >
> > But your patch is much false positive/negative chance because isolated pages timing
> > and too_many_isolated_zone() call site are in far distance place.
> 
> Yes.
> How about the returning *did_some_progress can imply too_many_isolated
> fail by using MSB or new variable?
> Then, page_allocator can check it whether it causes read reclaim fail
> or parallel reclaim.
> The point is let's throttle without holding FS/IO lock.

Wu's version sleep in shrink_inactive_list(). your version sleep in __alloc_pages_slowpath()
by wait_iff_congested(). both don't release lock, I think.
But, if alloc_pages() return fail if GFP_NOIO, we introduce another issue.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  2:52                           ` Minchan Kim
@ 2010-10-19  3:05                             ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-19  3:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
> Hi Wu,
> 
> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> @@ -2054,10 +2069,11 @@ rebalance:
> >>                 goto got_pg;
> >>
> >>         /*
> >> -        * If we failed to make any progress reclaiming, then we are
> >> -        * running out of options and have to consider going OOM
> >> +        * If we failed to make any progress reclaiming and there aren't
> >> +        * many parallel reclaiming, then we are unning out of options and
> >> +        * have to consider going OOM
> >>          */
> >> -       if (!did_some_progress) {
> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >>                         if (oom_killer_disabled)
> >>                                 goto nopage;
> >
> > This is simply wrong.
> >
> > It disabled this block for 99% system because there won't be enough
> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
> > will be scanned like mad and no task get OOMed when it should be.
> 
> If !too_many_isolated_zone is false, it means there are already many
> direct reclaiming tasks.
> So they could exit reclaim path and !too_many_isolated_zone will be true.
> What am I missing now?

Ah sorry, my brain get short circuited.. but I still feel uneasy with
this change. It's not fixing the root cause and won't prevent too many
LRU pages be isolated. It's too late to test too_many_isolated_zone()
after direct reclaim returns (after sleeping for a long time).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  3:05                             ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-19  3:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
> Hi Wu,
> 
> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> @@ -2054,10 +2069,11 @@ rebalance:
> >> A  A  A  A  A  A  A  A  goto got_pg;
> >>
> >> A  A  A  A  /*
> >> - A  A  A  A * If we failed to make any progress reclaiming, then we are
> >> - A  A  A  A * running out of options and have to consider going OOM
> >> + A  A  A  A * If we failed to make any progress reclaiming and there aren't
> >> + A  A  A  A * many parallel reclaiming, then we are unning out of options and
> >> + A  A  A  A * have to consider going OOM
> >> A  A  A  A  A */
> >> - A  A  A  if (!did_some_progress) {
> >> + A  A  A  if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
> >> A  A  A  A  A  A  A  A  if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> A  A  A  A  A  A  A  A  A  A  A  A  if (oom_killer_disabled)
> >> A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  goto nopage;
> >
> > This is simply wrong.
> >
> > It disabled this block for 99% system because there won't be enough
> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
> > will be scanned like mad and no task get OOMed when it should be.
> 
> If !too_many_isolated_zone is false, it means there are already many
> direct reclaiming tasks.
> So they could exit reclaim path and !too_many_isolated_zone will be true.
> What am I missing now?

Ah sorry, my brain get short circuited.. but I still feel uneasy with
this change. It's not fixing the root cause and won't prevent too many
LRU pages be isolated. It's too late to test too_many_isolated_zone()
after direct reclaim returns (after sleeping for a long time).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  3:05                             ` Wu Fengguang
@ 2010-10-19  3:09                               ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  3:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
>> Hi Wu,
>>
>> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> @@ -2054,10 +2069,11 @@ rebalance:
>> >>                 goto got_pg;
>> >>
>> >>         /*
>> >> -        * If we failed to make any progress reclaiming, then we are
>> >> -        * running out of options and have to consider going OOM
>> >> +        * If we failed to make any progress reclaiming and there aren't
>> >> +        * many parallel reclaiming, then we are unning out of options and
>> >> +        * have to consider going OOM
>> >>          */
>> >> -       if (!did_some_progress) {
>> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>> >>                         if (oom_killer_disabled)
>> >>                                 goto nopage;
>> >
>> > This is simply wrong.
>> >
>> > It disabled this block for 99% system because there won't be enough
>> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
>> > will be scanned like mad and no task get OOMed when it should be.
>>
>> If !too_many_isolated_zone is false, it means there are already many
>> direct reclaiming tasks.
>> So they could exit reclaim path and !too_many_isolated_zone will be true.
>> What am I missing now?
>
> Ah sorry, my brain get short circuited.. but I still feel uneasy with
> this change. It's not fixing the root cause and won't prevent too many
> LRU pages be isolated. It's too late to test too_many_isolated_zone()
> after direct reclaim returns (after sleeping for a long time).
>

Intend to agree.
I think root cause is a infinite looping in too_many_isolated holding FS lock.
Would it be simple that too_many_isolated would be bail out after some try?


> Thanks,
> Fengguang
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  3:09                               ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  3:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
>> Hi Wu,
>>
>> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> @@ -2054,10 +2069,11 @@ rebalance:
>> >>                 goto got_pg;
>> >>
>> >>         /*
>> >> -        * If we failed to make any progress reclaiming, then we are
>> >> -        * running out of options and have to consider going OOM
>> >> +        * If we failed to make any progress reclaiming and there aren't
>> >> +        * many parallel reclaiming, then we are unning out of options and
>> >> +        * have to consider going OOM
>> >>          */
>> >> -       if (!did_some_progress) {
>> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>> >>                         if (oom_killer_disabled)
>> >>                                 goto nopage;
>> >
>> > This is simply wrong.
>> >
>> > It disabled this block for 99% system because there won't be enough
>> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
>> > will be scanned like mad and no task get OOMed when it should be.
>>
>> If !too_many_isolated_zone is false, it means there are already many
>> direct reclaiming tasks.
>> So they could exit reclaim path and !too_many_isolated_zone will be true

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  3:09                               ` Minchan Kim
@ 2010-10-19  3:13                                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  3:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Wu Fengguang, Andrew Morton, Neil Brown,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
> >> Hi Wu,
> >>
> >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> @@ -2054,10 +2069,11 @@ rebalance:
> >> >>                 goto got_pg;
> >> >>
> >> >>         /*
> >> >> -        * If we failed to make any progress reclaiming, then we are
> >> >> -        * running out of options and have to consider going OOM
> >> >> +        * If we failed to make any progress reclaiming and there aren't
> >> >> +        * many parallel reclaiming, then we are unning out of options and
> >> >> +        * have to consider going OOM
> >> >>          */
> >> >> -       if (!did_some_progress) {
> >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
> >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> >>                         if (oom_killer_disabled)
> >> >>                                 goto nopage;
> >> >
> >> > This is simply wrong.
> >> >
> >> > It disabled this block for 99% system because there won't be enough
> >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
> >> > will be scanned like mad and no task get OOMed when it should be.
> >>
> >> If !too_many_isolated_zone is false, it means there are already many
> >> direct reclaiming tasks.
> >> So they could exit reclaim path and !too_many_isolated_zone will be true.
> >> What am I missing now?
> >
> > Ah sorry, my brain get short circuited.. but I still feel uneasy with
> > this change. It's not fixing the root cause and won't prevent too many
> > LRU pages be isolated. It's too late to test too_many_isolated_zone()
> > after direct reclaim returns (after sleeping for a long time).
> >
> 
> Intend to agree.
> I think root cause is a infinite looping in too_many_isolated holding FS lock.
> Would it be simple that too_many_isolated would be bail out after some try?

How?
A lot of caller don't have good recover logic when memory allocation fail occur.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  3:13                                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  3:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, Wu Fengguang, Andrew Morton, Neil Brown,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
> >> Hi Wu,
> >>
> >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> @@ -2054,10 +2069,11 @@ rebalance:
> >> >>                 goto got_pg;
> >> >>
> >> >>         /*
> >> >> -        * If we failed to make any progress reclaiming, then we are
> >> >> -        * running out of options and have to consider going OOM
> >> >> +        * If we failed to make any progress reclaiming and there aren't
> >> >> +        * many parallel reclaiming, then we are unning out of options and
> >> >> +        * have to consider going OOM
> >> >>          */
> >> >> -       if (!did_some_progress) {
> >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
> >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> >>                         if (oom_killer_disabled)
> >> >>                                 goto nopage;
> >> >
> >> > This is simply wrong.
> >> >
> >> > It disabled this block for 99% system because there won't be enough
> >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
> >> > will be scanned like mad and no task get OOMed when it should be.
> >>
> >> If !too_many_isolated_zone is false, it means there are already many
> >> direct reclaiming tasks.
> >> So they could exit reclaim path and !too_many_isolated_zone will be true.
> >> What am I missing now?
> >
> > Ah sorry, my brain get short circuited.. but I still feel uneasy with
> > this change. It's not fixing the root cause and won't prevent too many
> > LRU pages be isolated. It's too late to test too_many_isolated_zone()
> > after direct reclaim returns (after sleeping for a long time).
> >
> 
> Intend to agree.
> I think root cause is a infinite looping in too_many_isolated holding FS lock.
> Would it be simple that too_many_isolated would be bail out after some try?

How?
A lot of caller don't have good recover logic when memory allocation fail occur.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  3:09                               ` Minchan Kim
@ 2010-10-19  3:21                                 ` Shaohua Li
  -1 siblings, 0 replies; 116+ messages in thread
From: Shaohua Li @ 2010-10-19  3:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu, Fengguang, KOSAKI Motohiro, Andrew Morton, Neil Brown,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 19, 2010 at 11:09:29AM +0800, Minchan Kim wrote:
> On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
> >> Hi Wu,
> >>
> >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> @@ -2054,10 +2069,11 @@ rebalance:
> >> >>                 goto got_pg;
> >> >>
> >> >>         /*
> >> >> -        * If we failed to make any progress reclaiming, then we are
> >> >> -        * running out of options and have to consider going OOM
> >> >> +        * If we failed to make any progress reclaiming and there aren't
> >> >> +        * many parallel reclaiming, then we are unning out of options and
> >> >> +        * have to consider going OOM
> >> >>          */
> >> >> -       if (!did_some_progress) {
> >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
> >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> >>                         if (oom_killer_disabled)
> >> >>                                 goto nopage;
> >> >
> >> > This is simply wrong.
> >> >
> >> > It disabled this block for 99% system because there won't be enough
> >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
> >> > will be scanned like mad and no task get OOMed when it should be.
> >>
> >> If !too_many_isolated_zone is false, it means there are already many
> >> direct reclaiming tasks.
> >> So they could exit reclaim path and !too_many_isolated_zone will be true.
> >> What am I missing now?
> >
> > Ah sorry, my brain get short circuited.. but I still feel uneasy with
> > this change. It's not fixing the root cause and won't prevent too many
> > LRU pages be isolated. It's too late to test too_many_isolated_zone()
> > after direct reclaim returns (after sleeping for a long time).
> >
> 
> Intend to agree.
> I think root cause is a infinite looping in too_many_isolated holding FS lock.
> Would it be simple that too_many_isolated would be bail out after some try?
I'm wondering if we need too_many_isolated_zone logic. The do_try_to_free_pages
will return progress till all zones are unreclaimable. Assume before this we
don't oomkiller. If the direct reclaim fails but has progress, it will sleep.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  3:21                                 ` Shaohua Li
  0 siblings, 0 replies; 116+ messages in thread
From: Shaohua Li @ 2010-10-19  3:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu, Fengguang, KOSAKI Motohiro, Andrew Morton, Neil Brown,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 19, 2010 at 11:09:29AM +0800, Minchan Kim wrote:
> On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
> >> Hi Wu,
> >>
> >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> @@ -2054,10 +2069,11 @@ rebalance:
> >> >>                 goto got_pg;
> >> >>
> >> >>         /*
> >> >> -        * If we failed to make any progress reclaiming, then we are
> >> >> -        * running out of options and have to consider going OOM
> >> >> +        * If we failed to make any progress reclaiming and there aren't
> >> >> +        * many parallel reclaiming, then we are unning out of options and
> >> >> +        * have to consider going OOM
> >> >>          */
> >> >> -       if (!did_some_progress) {
> >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
> >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> >>                         if (oom_killer_disabled)
> >> >>                                 goto nopage;
> >> >
> >> > This is simply wrong.
> >> >
> >> > It disabled this block for 99% system because there won't be enough
> >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
> >> > will be scanned like mad and no task get OOMed when it should be.
> >>
> >> If !too_many_isolated_zone is false, it means there are already many
> >> direct reclaiming tasks.
> >> So they could exit reclaim path and !too_many_isolated_zone will be true.
> >> What am I missing now?
> >
> > Ah sorry, my brain get short circuited.. but I still feel uneasy with
> > this change. It's not fixing the root cause and won't prevent too many
> > LRU pages be isolated. It's too late to test too_many_isolated_zone()
> > after direct reclaim returns (after sleeping for a long time).
> >
> 
> Intend to agree.
> I think root cause is a infinite looping in too_many_isolated holding FS lock.
> Would it be simple that too_many_isolated would be bail out after some try?
I'm wondering if we need too_many_isolated_zone logic. The do_try_to_free_pages
will return progress till all zones are unreclaimable. Assume before this we
don't oomkiller. If the direct reclaim fails but has progress, it will sleep.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  3:13                                 ` KOSAKI Motohiro
@ 2010-10-19  5:11                                   ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  5:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu Fengguang, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 12:13 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
>> >> Hi Wu,
>> >>
>> >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> >> @@ -2054,10 +2069,11 @@ rebalance:
>> >> >>                 goto got_pg;
>> >> >>
>> >> >>         /*
>> >> >> -        * If we failed to make any progress reclaiming, then we are
>> >> >> -        * running out of options and have to consider going OOM
>> >> >> +        * If we failed to make any progress reclaiming and there aren't
>> >> >> +        * many parallel reclaiming, then we are unning out of options and
>> >> >> +        * have to consider going OOM
>> >> >>          */
>> >> >> -       if (!did_some_progress) {
>> >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>> >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>> >> >>                         if (oom_killer_disabled)
>> >> >>                                 goto nopage;
>> >> >
>> >> > This is simply wrong.
>> >> >
>> >> > It disabled this block for 99% system because there won't be enough
>> >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
>> >> > will be scanned like mad and no task get OOMed when it should be.
>> >>
>> >> If !too_many_isolated_zone is false, it means there are already many
>> >> direct reclaiming tasks.
>> >> So they could exit reclaim path and !too_many_isolated_zone will be true.
>> >> What am I missing now?
>> >
>> > Ah sorry, my brain get short circuited.. but I still feel uneasy with
>> > this change. It's not fixing the root cause and won't prevent too many
>> > LRU pages be isolated. It's too late to test too_many_isolated_zone()
>> > after direct reclaim returns (after sleeping for a long time).
>> >
>>
>> Intend to agree.
>> I think root cause is a infinite looping in too_many_isolated holding FS lock.
>> Would it be simple that too_many_isolated would be bail out after some try?
>
> How?
> A lot of caller don't have good recover logic when memory allocation fail occur.
>

I means following as.

1. shrink_inactive_list
2. if too_many_isolated is looping than 5 times, it marks some
variable to notice this fail is concurrent reclaim and bail out
3. __alloc_pages_slowpath see that did_some_progress is zero and the
mark which show bailout by concurrent reclaim.
4. Instead of OOM, congestion_wait and rebalance.

While I implement it, I knew it makes code rather ugly and I thought
lost is bigger than gain.

Okay. I will drop this idea.

Thanks for advising me, Wu, KOSAKI.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  5:11                                   ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  5:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu Fengguang, Andrew Morton, Neil Brown, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 12:13 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
>> >> Hi Wu,
>> >>
>> >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> >> @@ -2054,10 +2069,11 @@ rebalance:
>> >> >>                 goto got_pg;
>> >> >>
>> >> >>         /*
>> >> >> -        * If we failed to make any progress reclaiming, then we are
>> >> >> -        * running out of options and have to consider going OOM
>> >> >> +        * If we failed to make any progress reclaiming and there aren't
>> >> >> +        * many parallel reclaiming, then we are unning out of options and
>> >> >> +        * have to consider going OOM
>> >> >>          */
>> >> >> -       if (!did_some_progress) {
>> >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>> >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>> >> >>                         if (oom_killer_disabled)
>> >> >>                                 goto nopage;
>> >> >
>> >> > This is simply wrong.
>> >> >
>> >> > It disabled this block for 99% system because there won't be enough
>> >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
>> >> > will be scanned like mad and no task get OOMed when it should be.
>> >>
>> >> If !too_many_isolated_zone is false, it means there are already many
>> >> direct reclaiming tasks.
>> >> So they could exit reclaim path and !too_many_isolated_zone will be true.
>> >> What am I missing now?
>> >
>> > Ah sorry, my brain get short circuited.. but I still feel uneasy with
>> > this change. It's not fixing the root cause and won't prevent too many
>> > LRU pages be isolated. It's too late to test too_many_isolated_zone()
>> > after direct reclaim returns (after sleeping for a long time).
>> >
>>
>> Intend to agree.
>> I think root cause is a infinite looping in too_many_isolated holding FS lock.
>> Would it be simple that too_many_isolated would be bail out after some try?
>
> How?
> A lot of caller don't have good recover logic when memory allocation fail occur.
>

I means following as.

1. shrink_inactive_list
2. if too_many_isolated is looping than 5 times, it marks some
variable to notice this fail is concurrent reclaim and bail out
3. __alloc_pages_slowpath see that did_some_progress is zero and the
mark which show bailout by concurrent reclaim.
4. Instead of OOM, congestion_wait and rebalance.

While I implement it, I knew it makes code rather ugly and I thought
lost is bigger than gain.

Okay. I will drop this idea.

Thanks for advising me, Wu, KOSAKI.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  3:21                                 ` Shaohua Li
@ 2010-10-19  7:15                                   ` Shaohua Li
  -1 siblings, 0 replies; 116+ messages in thread
From: Shaohua Li @ 2010-10-19  7:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu, Fengguang, KOSAKI Motohiro, Andrew Morton, Neil Brown,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 19, 2010 at 11:21:45AM +0800, Shaohua Li wrote:
> On Tue, Oct 19, 2010 at 11:09:29AM +0800, Minchan Kim wrote:
> > On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
> > >> Hi Wu,
> > >>
> > >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >> >> @@ -2054,10 +2069,11 @@ rebalance:
> > >> >>                 goto got_pg;
> > >> >>
> > >> >>         /*
> > >> >> -        * If we failed to make any progress reclaiming, then we are
> > >> >> -        * running out of options and have to consider going OOM
> > >> >> +        * If we failed to make any progress reclaiming and there aren't
> > >> >> +        * many parallel reclaiming, then we are unning out of options and
> > >> >> +        * have to consider going OOM
> > >> >>          */
> > >> >> -       if (!did_some_progress) {
> > >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
> > >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >> >>                         if (oom_killer_disabled)
> > >> >>                                 goto nopage;
> > >> >
> > >> > This is simply wrong.
> > >> >
> > >> > It disabled this block for 99% system because there won't be enough
> > >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
> > >> > will be scanned like mad and no task get OOMed when it should be.
> > >>
> > >> If !too_many_isolated_zone is false, it means there are already many
> > >> direct reclaiming tasks.
> > >> So they could exit reclaim path and !too_many_isolated_zone will be true.
> > >> What am I missing now?
> > >
> > > Ah sorry, my brain get short circuited.. but I still feel uneasy with
> > > this change. It's not fixing the root cause and won't prevent too many
> > > LRU pages be isolated. It's too late to test too_many_isolated_zone()
> > > after direct reclaim returns (after sleeping for a long time).
> > >
> > 
> > Intend to agree.
> > I think root cause is a infinite looping in too_many_isolated holding FS lock.
> > Would it be simple that too_many_isolated would be bail out after some try?
> I'm wondering if we need too_many_isolated_zone logic. The do_try_to_free_pages
> will return progress till all zones are unreclaimable. Assume before this we
> don't oomkiller. If the direct reclaim fails but has progress, it will sleep.
Not sure if this is clear. What I mean is we can delete too_many_isolated_zone,
do_try_to_free_pages can still return 1 till all zones are unreclaimable. Before
this direct reclaim will not oom, because it sees progress and will call congestion_wait
to sleep. Am I missing anything?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  7:15                                   ` Shaohua Li
  0 siblings, 0 replies; 116+ messages in thread
From: Shaohua Li @ 2010-10-19  7:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu, Fengguang, KOSAKI Motohiro, Andrew Morton, Neil Brown,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 19, 2010 at 11:21:45AM +0800, Shaohua Li wrote:
> On Tue, Oct 19, 2010 at 11:09:29AM +0800, Minchan Kim wrote:
> > On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
> > >> Hi Wu,
> > >>
> > >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >> >> @@ -2054,10 +2069,11 @@ rebalance:
> > >> >>                 goto got_pg;
> > >> >>
> > >> >>         /*
> > >> >> -        * If we failed to make any progress reclaiming, then we are
> > >> >> -        * running out of options and have to consider going OOM
> > >> >> +        * If we failed to make any progress reclaiming and there aren't
> > >> >> +        * many parallel reclaiming, then we are unning out of options and
> > >> >> +        * have to consider going OOM
> > >> >>          */
> > >> >> -       if (!did_some_progress) {
> > >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
> > >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >> >>                         if (oom_killer_disabled)
> > >> >>                                 goto nopage;
> > >> >
> > >> > This is simply wrong.
> > >> >
> > >> > It disabled this block for 99% system because there won't be enough
> > >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
> > >> > will be scanned like mad and no task get OOMed when it should be.
> > >>
> > >> If !too_many_isolated_zone is false, it means there are already many
> > >> direct reclaiming tasks.
> > >> So they could exit reclaim path and !too_many_isolated_zone will be true.
> > >> What am I missing now?
> > >
> > > Ah sorry, my brain get short circuited.. but I still feel uneasy with
> > > this change. It's not fixing the root cause and won't prevent too many
> > > LRU pages be isolated. It's too late to test too_many_isolated_zone()
> > > after direct reclaim returns (after sleeping for a long time).
> > >
> > 
> > Intend to agree.
> > I think root cause is a infinite looping in too_many_isolated holding FS lock.
> > Would it be simple that too_many_isolated would be bail out after some try?
> I'm wondering if we need too_many_isolated_zone logic. The do_try_to_free_pages
> will return progress till all zones are unreclaimable. Assume before this we
> don't oomkiller. If the direct reclaim fails but has progress, it will sleep.
Not sure if this is clear. What I mean is we can delete too_many_isolated_zone,
do_try_to_free_pages can still return 1 till all zones are unreclaimable. Before
this direct reclaim will not oom, because it sees progress and will call congestion_wait
to sleep. Am I missing anything?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  7:15                                   ` Shaohua Li
@ 2010-10-19  7:34                                     ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  7:34 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Wu, Fengguang, KOSAKI Motohiro, Andrew Morton, Neil Brown,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 19, 2010 at 4:15 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Tue, Oct 19, 2010 at 11:21:45AM +0800, Shaohua Li wrote:
>> On Tue, Oct 19, 2010 at 11:09:29AM +0800, Minchan Kim wrote:
>> > On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
>> > >> Hi Wu,
>> > >>
>> > >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >> >> @@ -2054,10 +2069,11 @@ rebalance:
>> > >> >>                 goto got_pg;
>> > >> >>
>> > >> >>         /*
>> > >> >> -        * If we failed to make any progress reclaiming, then we are
>> > >> >> -        * running out of options and have to consider going OOM
>> > >> >> +        * If we failed to make any progress reclaiming and there aren't
>> > >> >> +        * many parallel reclaiming, then we are unning out of options and
>> > >> >> +        * have to consider going OOM
>> > >> >>          */
>> > >> >> -       if (!did_some_progress) {
>> > >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>> > >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>> > >> >>                         if (oom_killer_disabled)
>> > >> >>                                 goto nopage;
>> > >> >
>> > >> > This is simply wrong.
>> > >> >
>> > >> > It disabled this block for 99% system because there won't be enough
>> > >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
>> > >> > will be scanned like mad and no task get OOMed when it should be.
>> > >>
>> > >> If !too_many_isolated_zone is false, it means there are already many
>> > >> direct reclaiming tasks.
>> > >> So they could exit reclaim path and !too_many_isolated_zone will be true.
>> > >> What am I missing now?
>> > >
>> > > Ah sorry, my brain get short circuited.. but I still feel uneasy with
>> > > this change. It's not fixing the root cause and won't prevent too many
>> > > LRU pages be isolated. It's too late to test too_many_isolated_zone()
>> > > after direct reclaim returns (after sleeping for a long time).
>> > >
>> >
>> > Intend to agree.
>> > I think root cause is a infinite looping in too_many_isolated holding FS lock.
>> > Would it be simple that too_many_isolated would be bail out after some try?
>> I'm wondering if we need too_many_isolated_zone logic. The do_try_to_free_pages
>> will return progress till all zones are unreclaimable. Assume before this we
>> don't oomkiller. If the direct reclaim fails but has progress, it will sleep.
> Not sure if this is clear. What I mean is we can delete too_many_isolated_zone,
> do_try_to_free_pages can still return 1 till all zones are unreclaimable. Before
> this direct reclaim will not oom, because it sees progress and will call congestion_wait
> to sleep. Am I missing anything?
>

You mean could we remove too_many_isolated? not too_many_isolated_zone. Right?
It it is, we can't.

Your saying is right.
do_try_to_free_pages can return 1 until all zones are unreclaimable.
But If we remove throttling logic which is too_many_isolated, too many
process can enter direct reclaim path and they can increase
zone->pages_scanned while zone_reclaimable pages are decreased by
isolation. At last, we could reach all zones are unreclaimable much
faster.




-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  7:34                                     ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-19  7:34 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Wu, Fengguang, KOSAKI Motohiro, Andrew Morton, Neil Brown,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 19, 2010 at 4:15 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Tue, Oct 19, 2010 at 11:21:45AM +0800, Shaohua Li wrote:
>> On Tue, Oct 19, 2010 at 11:09:29AM +0800, Minchan Kim wrote:
>> > On Tue, Oct 19, 2010 at 12:05 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > > On Tue, Oct 19, 2010 at 10:52:47AM +0800, Minchan Kim wrote:
>> > >> Hi Wu,
>> > >>
>> > >> On Tue, Oct 19, 2010 at 11:35 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >> >> @@ -2054,10 +2069,11 @@ rebalance:
>> > >> >>                 goto got_pg;
>> > >> >>
>> > >> >>         /*
>> > >> >> -        * If we failed to make any progress reclaiming, then we are
>> > >> >> -        * running out of options and have to consider going OOM
>> > >> >> +        * If we failed to make any progress reclaiming and there aren't
>> > >> >> +        * many parallel reclaiming, then we are unning out of options and
>> > >> >> +        * have to consider going OOM
>> > >> >>          */
>> > >> >> -       if (!did_some_progress) {
>> > >> >> +       if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) {
>> > >> >>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>> > >> >>                         if (oom_killer_disabled)
>> > >> >>                                 goto nopage;
>> > >> >
>> > >> > This is simply wrong.
>> > >> >
>> > >> > It disabled this block for 99% system because there won't be enough
>> > >> > tasks to make (!too_many_isolated_zone == true). As a result the LRU
>> > >> > will be scanned like mad and no task get OOMed when it should be.
>> > >>
>> > >> If !too_many_isolated_zone is false, it means there are already many
>> > >> direct reclaiming tasks.
>> > >> So they could exit reclaim path and !too_many_isolated_zone will be true.
>> > >> What am I missing now?
>> > >
>> > > Ah sorry, my brain get short circuited.. but I still feel uneasy with
>> > > this change. It's not fixing the root cause and won't prevent too many
>> > > LRU pages be isolated. It's too late to test too_many_isolated_zone()
>> > > after direct reclaim returns (after sleeping for a long time).
>> > >
>> >
>> > Intend to agree.
>> > I think root cause is a infinite looping in too_many_isolated holding FS lock.
>> > Would it be simple that too_many_isolated would be bail out after some try?
>> I'm wondering if we need too_many_isolated_zone logic. The do_try_to_free_pages
>> will return progress till all zones are unreclaimable. Assume before this we
>> don't oomkiller. If the direct reclaim fails but has progress, it will sleep.
> Not sure if this is clear. What I mean is we can delete too_many_isolated_zone,
> do_try_to_free_pages can still return 1 till all zones are unreclaimable. Before
> this direct reclaim will not oom, because it sees progress and will call congestion_wait
> to sleep. Am I missing anything?
>

You mean could we remove too_many_isolated? not too_many_isolated_zone. Right?
It it is, we can't.

Your saying is right.
do_try_to_free_pages can return 1 until all zones are unreclaimable.
But If we remove throttling logic which is too_many_isolated, too many
process can enter direct reclaim path and they can increase
zone->pages_scanned while zone_reclaimable pages are decreased by
isolation. At last, we could reach all zones are unreclaimable much
faster.




-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-18 23:11               ` Neil Brown
@ 2010-10-19  8:43                 ` Torsten Kaiser
  -1 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-19  8:43 UTC (permalink / raw)
  To: Neil Brown
  Cc: Wu Fengguang, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li Shaohua

On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> On Mon, 18 Oct 2010 12:58:17 +0200
> Torsten Kaiser <just.for.lkml@googlemail.com> wrote:
>
>> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@suse.de> wrote:
>> > Testing shows that this patch seems to work.
>> > The test load (essentially kernbench) doesn't deadlock any more, though it
>> > does get bogged down thrashing in swap so it doesn't make a lot more
>> > progress :-)  I guess that is to be expected.
>>
>> I just noticed this thread, as your mail from today pushed it up.
>>
>> In your original mail you wrote: " I recently had a customer (running
>> 2.6.32) report a deadlock during very intensive IO with lots of
>> processes. " and " Some threads that are blocked there, hold some IO
>> lock (probably in the filesystem) and are trying to allocate memory
>> inside the block device (md/raid1 to be precise) which is allocating
>> with GFP_NOIO and has a mempool to fall back on."
>>
>> I recently had the same problem (intense IO due to swapstorm created
>> by 20 gcc processes hung my system) and after initially blaming the
>> workqueue changes in 2.6.36 Tejun Heo determined that my problem was
>> not the workqueues getting locked up, but that it was cause by an
>> exhausted mempool:
>> http://marc.info/?l=linux-kernel&m=128655737012549&w=2
>>
>> Instrumenting mm/mempool.c and retrying my workload showed that
>> fs_bio_set from fs/bio.c looked like the mempool to blame and the code
>> in drivers/md/raid1.c to be the misuser:
>> http://marc.info/?l=linux-kernel&m=128671179817823&w=2
>>
>> I was even able to reproduce this hang with only using a normal RAID1
>> md device as swapspace and then using dd to fill a tmpfs until
>> swapping was needed:
>> http://marc.info/?l=linux-raid&m=128699402805191&w=2
>>
>> Looking back in the history of raid1.c and bio.c I found the following
>> interesting parts:
>>
>>  * the change to allocate more then one bio via bio_clone() is from
>> 2005, but it looks like it was OK back then, because at that point the
>> fs_bio_set was allocation 256 entries
>>  * in 2007 the size of the mempool was changed from 256 to only 2
>> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is
>> enough, lets scale it down to 2 just to be on the safe side.")
>>  * only in 2009 the comment "To make this work, callers must never
>> allocate more than 1 bio at the time from this pool. Callers that need
>> to allocate more than 1 bio must always submit the previously allocate
>> bio for IO before attempting to allocate a new one. Failure to do so
>> can cause livelocks under memory pressure." was added to bio_alloc()
>> that is the base from my reasoning that raid1.c is broken. (And such a
>> comment was not added to bio_clone() although both calls use the same
>> mempool)
>>
>> So could please look someone into raid1.c to confirm or deny that
>> using multiple bio_clone() (one per drive) before submitting them
>> together could also cause such deadlocks?
>>
>> Thank for looking
>>
>> Torsten
>
> Yes, thanks for the report.
> This is a real bug exactly as you describe.
>
> This is how I think I will fix it, though it needs a bit of review and
> testing before I can be certain.
> Also I need to check raid10 etc to see if they can suffer too.
>
> If you can test it I would really appreciate it.

I did test it, but while it seemed to fix the deadlock, the system
still got unusable.
The still running "vmstat 1" showed that the swapout was still
progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.

I also tried to additionally add Wu's patch:
--- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
+++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
@@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
       }

+       /*
+        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
+        * they won't get blocked by normal ones and form circular deadlock.
+        */
+       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
+               inactive >>= 3;
+
       return isolated > inactive;

Either it did help somewhat, or I was more lucky on my second try, but
this time I needed ~5 tries instead of only 2 to get the system mostly
stuck again. On the testrun with Wu's patch the writeout pattern was
more stable, a burst of ~80kb each 20 seconds. But I would suspect
that the size of the burst is rather random.

I do have a complete SysRq+T dump from the first run, I can send that
to anyone how wants it.
(It's 190k so I don't want not spam it to the list)


Torsten

> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d44a50f..8122dde 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -784,7 +784,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
>        int i, targets = 0, disks;
>        struct bitmap *bitmap;
>        unsigned long flags;
> -       struct bio_list bl;
>        struct page **behind_pages = NULL;
>        const int rw = bio_data_dir(bio);
>        const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
> @@ -892,13 +891,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
>         * bios[x] to bio
>         */
>        disks = conf->raid_disks;
> -#if 0
> -       { static int first=1;
> -       if (first) printk("First Write sector %llu disks %d\n",
> -                         (unsigned long long)r1_bio->sector, disks);
> -       first = 0;
> -       }
> -#endif
>  retry_write:
>        blocked_rdev = NULL;
>        rcu_read_lock();
> @@ -956,14 +948,15 @@ static int make_request(mddev_t *mddev, struct bio * bio)
>            (behind_pages = alloc_behind_pages(bio)) != NULL)
>                set_bit(R1BIO_BehindIO, &r1_bio->state);
>
> -       atomic_set(&r1_bio->remaining, 0);
> +       atomic_set(&r1_bio->remaining, targets);
>        atomic_set(&r1_bio->behind_remaining, 0);
>
>        do_barriers = bio->bi_rw & REQ_HARDBARRIER;
>        if (do_barriers)
>                set_bit(R1BIO_Barrier, &r1_bio->state);
>
> -       bio_list_init(&bl);
> +       bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
> +                               test_bit(R1BIO_BehindIO, &r1_bio->state));
>        for (i = 0; i < disks; i++) {
>                struct bio *mbio;
>                if (!r1_bio->bios[i])
> @@ -995,30 +988,18 @@ static int make_request(mddev_t *mddev, struct bio * bio)
>                                atomic_inc(&r1_bio->behind_remaining);
>                }
>
> -               atomic_inc(&r1_bio->remaining);
> -
> -               bio_list_add(&bl, mbio);
> +               spin_lock_irqsave(&conf->device_lock, flags);
> +               bio_list_add(&conf->pending_bio_list, mbio);
> +               blk_plug_device(mddev->queue);
> +               spin_unlock_irqrestore(&conf->device_lock, flags);
>        }
>        kfree(behind_pages); /* the behind pages are attached to the bios now */
>
> -       bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
> -                               test_bit(R1BIO_BehindIO, &r1_bio->state));
> -       spin_lock_irqsave(&conf->device_lock, flags);
> -       bio_list_merge(&conf->pending_bio_list, &bl);
> -       bio_list_init(&bl);
> -
> -       blk_plug_device(mddev->queue);
> -       spin_unlock_irqrestore(&conf->device_lock, flags);
> -
>        /* In case raid1d snuck into freeze_array */
>        wake_up(&conf->wait_barrier);
>
>        if (do_sync)
>                md_wakeup_thread(mddev->thread);
> -#if 0
> -       while ((bio = bio_list_pop(&bl)) != NULL)
> -               generic_make_request(bio);
> -#endif
>
>        return 0;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19  8:43                 ` Torsten Kaiser
  0 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-19  8:43 UTC (permalink / raw)
  To: Neil Brown
  Cc: Wu Fengguang, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li Shaohua

On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> On Mon, 18 Oct 2010 12:58:17 +0200
> Torsten Kaiser <just.for.lkml@googlemail.com> wrote:
>
>> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@suse.de> wrote:
>> > Testing shows that this patch seems to work.
>> > The test load (essentially kernbench) doesn't deadlock any more, though it
>> > does get bogged down thrashing in swap so it doesn't make a lot more
>> > progress :-)  I guess that is to be expected.
>>
>> I just noticed this thread, as your mail from today pushed it up.
>>
>> In your original mail you wrote: " I recently had a customer (running
>> 2.6.32) report a deadlock during very intensive IO with lots of
>> processes. " and " Some threads that are blocked there, hold some IO
>> lock (probably in the filesystem) and are trying to allocate memory
>> inside the block device (md/raid1 to be precise) which is allocating
>> with GFP_NOIO and has a mempool to fall back on."
>>
>> I recently had the same problem (intense IO due to swapstorm created
>> by 20 gcc processes hung my system) and after initially blaming the
>> workqueue changes in 2.6.36 Tejun Heo determined that my problem was
>> not the workqueues getting locked up, but that it was cause by an
>> exhausted mempool:
>> http://marc.info/?l=linux-kernel&m=128655737012549&w=2
>>
>> Instrumenting mm/mempool.c and retrying my workload showed that
>> fs_bio_set from fs/bio.c looked like the mempool to blame and the code
>> in drivers/md/raid1.c to be the misuser:
>> http://marc.info/?l=linux-kernel&m=128671179817823&w=2
>>
>> I was even able to reproduce this hang with only using a normal RAID1
>> md device as swapspace and then using dd to fill a tmpfs until
>> swapping was needed:
>> http://marc.info/?l=linux-raid&m=128699402805191&w=2
>>
>> Looking back in the history of raid1.c and bio.c I found the following
>> interesting parts:
>>
>>  * the change to allocate more then one bio via bio_clone() is from
>> 2005, but it looks like it was OK back then, because at that point the
>> fs_bio_set was allocation 256 entries
>>  * in 2007 the size of the mempool was changed from 256 to only 2
>> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is
>> enough, lets scale it down to 2 just to be on the safe side.")
>>  * only in 2009 the comment "To make this work, callers must never
>> allocate more than 1 bio at the time from this pool. Callers that need
>> to allocate more than 1 bio must always submit the previously allocate
>> bio for IO before attempting to allocate a new one. Failure to do so
>> can cause livelocks under memory pressure." was added to bio_alloc()
>> that is the base from my reasoning that raid1.c is broken. (And such a
>> comment was not added to bio_clone() although both calls use the same
>> mempool)
>>
>> So could please look someone into raid1.c to confirm or deny that
>> using multiple bio_clone() (one per drive) before submitting them
>> together could also cause such deadlocks?
>>
>> Thank for looking
>>
>> Torsten
>
> Yes, thanks for the report.
> This is a real bug exactly as you describe.
>
> This is how I think I will fix it, though it needs a bit of review and
> testing before I can be certain.
> Also I need to check raid10 etc to see if they can suffer too.
>
> If you can test it I would really appreciate it.

I did test it, but while it seemed to fix the deadlock, the system
still got unusable.
The still running "vmstat 1" showed that the swapout was still
progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.

I also tried to additionally add Wu's patch:
--- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
+++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
@@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
       }

+       /*
+        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
+        * they won't get blocked by normal ones and form circular deadlock

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19  8:43                 ` Torsten Kaiser
@ 2010-10-19 10:06                   ` Torsten Kaiser
  -1 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-19 10:06 UTC (permalink / raw)
  To: Neil Brown
  Cc: Wu Fengguang, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li Shaohua

On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
<just.for.lkml@googlemail.com> wrote:
> On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
>> Yes, thanks for the report.
>> This is a real bug exactly as you describe.
>>
>> This is how I think I will fix it, though it needs a bit of review and
>> testing before I can be certain.
>> Also I need to check raid10 etc to see if they can suffer too.
>>
>> If you can test it I would really appreciate it.
>
> I did test it, but while it seemed to fix the deadlock, the system
> still got unusable.
> The still running "vmstat 1" showed that the swapout was still
> progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
>
> I also tried to additionally add Wu's patch:
> --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
> @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
>               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
>       }
>
> +       /*
> +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> +        * they won't get blocked by normal ones and form circular deadlock.
> +        */
> +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> +               inactive >>= 3;
> +
>       return isolated > inactive;
>
> Either it did help somewhat, or I was more lucky on my second try, but
> this time I needed ~5 tries instead of only 2 to get the system mostly
> stuck again. On the testrun with Wu's patch the writeout pattern was
> more stable, a burst of ~80kb each 20 seconds. But I would suspect
> that the size of the burst is rather random.
>
> I do have a complete SysRq+T dump from the first run, I can send that
> to anyone how wants it.
> (It's 190k so I don't want not spam it to the list)

Is this call trace from the SysRq+T violation the rule to only
allocate one bio from bio_alloc() until its submitted?

[  549.700038] Call Trace:
[  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
[  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
[  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
[  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
[  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
[  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
[  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
[  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
[  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
[  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
[  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
[  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
[  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
[  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
[  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
[  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
[  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
[  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
[  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
[  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
[  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
[  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
[  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
[  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
[  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
[  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
[  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
[  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
[  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
[  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
[  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
[  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
[  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
[  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
[  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
[  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
ffffffff81073c59
[  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
ffff88011e125fd8
[  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
ffff88011e125fd8

swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
bio. That bio is the submitted, but the submit path seems to get into
make_request from raid1.c and that allocates a second bio from
bio_alloc() via bio_clone().

I am seeing this pattern (swap_writepage calling
md_make_request/make_request and then getting stuck in mempool_alloc)
more than 5 times in the SysRq+T output...


Torsten

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-19 10:06                   ` Torsten Kaiser
  0 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-19 10:06 UTC (permalink / raw)
  To: Neil Brown
  Cc: Wu Fengguang, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li Shaohua

On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
<just.for.lkml@googlemail.com> wrote:
> On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
>> Yes, thanks for the report.
>> This is a real bug exactly as you describe.
>>
>> This is how I think I will fix it, though it needs a bit of review and
>> testing before I can be certain.
>> Also I need to check raid10 etc to see if they can suffer too.
>>
>> If you can test it I would really appreciate it.
>
> I did test it, but while it seemed to fix the deadlock, the system
> still got unusable.
> The still running "vmstat 1" showed that the swapout was still
> progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
>
> I also tried to additionally add Wu's patch:
> --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
> @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
>               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
>       }
>
> +       /*
> +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> +        * they won't get blocked by normal ones and form circular deadlock.
> +        */
> +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> +               inactive >>= 3;
> +
>       return isolated > inactive;
>
> Either it did help somewhat, or I was more lucky on my second try, but
> this time I needed ~5 tries instead of only 2 to get the system mostly
> stuck again. On the testrun with Wu's patch the writeout pattern was
> more stable, a burst of ~80kb each 20 seconds. But I would suspect
> that the size of the burst is rather random.
>
> I do have a complete SysRq+T dump from the first run, I can send that
> to anyone how wants it.
> (It's 190k so I don't want not spam it to the list)

Is this call trace from the SysRq+T violation the rule to only
allocate one bio from bio_alloc() until its submitted?

[  549.700038] Call Trace:
[  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
[  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
[  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
[  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
[  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
[  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
[  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
[  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
[  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
[  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
[  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
[  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
[  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
[  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
[  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
[  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
[  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
[  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
[  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
[  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
[  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
[  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
[  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
[  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
[  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
[  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
[  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
[  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
[  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
[  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
[  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
[  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
[  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
[  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
[  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
[  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
ffffffff81073c59
[  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
ffff88011e125fd8
[  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
ffff88011e125fd8

swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
bio. That bio is the submitted, but the submit path seems to get into
make_request from raid1.c and that allocates a second bio from
bio_alloc() via bio_clone().

I am seeing this pattern (swap_writepage calling
md_make_request/make_request and then getting stuck in mempool_alloc)
more than 5 times in the SysRq+T output...


Torsten

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-19 10:06                   ` Torsten Kaiser
@ 2010-10-20  5:57                     ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-20  5:57 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> <just.for.lkml@googlemail.com> wrote:
> > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> >> Yes, thanks for the report.
> >> This is a real bug exactly as you describe.
> >>
> >> This is how I think I will fix it, though it needs a bit of review and
> >> testing before I can be certain.
> >> Also I need to check raid10 etc to see if they can suffer too.
> >>
> >> If you can test it I would really appreciate it.
> >
> > I did test it, but while it seemed to fix the deadlock, the system
> > still got unusable.
> > The still running "vmstat 1" showed that the swapout was still
> > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> >
> > I also tried to additionally add Wu's patch:
> > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> > +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
> > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> >               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> >       }
> >
> > +       /*
> > +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> > +        * they won't get blocked by normal ones and form circular deadlock.
> > +        */
> > +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> > +               inactive >>= 3;
> > +
> >       return isolated > inactive;
> >
> > Either it did help somewhat, or I was more lucky on my second try, but
> > this time I needed ~5 tries instead of only 2 to get the system mostly
> > stuck again. On the testrun with Wu's patch the writeout pattern was
> > more stable, a burst of ~80kb each 20 seconds. But I would suspect
> > that the size of the burst is rather random.
> >
> > I do have a complete SysRq+T dump from the first run, I can send that
> > to anyone how wants it.
> > (It's 190k so I don't want not spam it to the list)
> 
> Is this call trace from the SysRq+T violation the rule to only
> allocate one bio from bio_alloc() until its submitted?
> 
> [  549.700038] Call Trace:
> [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> ffffffff81073c59
> [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> ffff88011e125fd8
> [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> ffff88011e125fd8
> 
> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> bio. That bio is the submitted, but the submit path seems to get into
> make_request from raid1.c and that allocates a second bio from
> bio_alloc() via bio_clone().
> 
> I am seeing this pattern (swap_writepage calling
> md_make_request/make_request and then getting stuck in mempool_alloc)
> more than 5 times in the SysRq+T output...

I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
inside mempool_alloc(), which can be fixed by this patch.

Thanks,
Fengguang
---

concurrent direct page reclaim problem

  __GFP_NORETRY page allocations may fail when there are many concurrent page
  allocating tasks, but not necessary in real short of memory. The root cause
  is, tasks will first run direct page reclaim to free some pages from the LRU
  lists and put them to the per-cpu page lists and the buddy system, and then
  try to get a free page from there.  However the free pages reclaimed by this
  task may be consumed by other tasks when the direct reclaim task is able to
  get the free page for itself.

  Let's retry it a bit harder.

--- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
+++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
@@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
 				unsigned long pages_reclaimed)
 {
 	/* Do not loop if specifically requested */
-	if (gfp_mask & __GFP_NORETRY)
+	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
 		return 0;
 
 	/*

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20  5:57                     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-20  5:57 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> <just.for.lkml@googlemail.com> wrote:
> > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> >> Yes, thanks for the report.
> >> This is a real bug exactly as you describe.
> >>
> >> This is how I think I will fix it, though it needs a bit of review and
> >> testing before I can be certain.
> >> Also I need to check raid10 etc to see if they can suffer too.
> >>
> >> If you can test it I would really appreciate it.
> >
> > I did test it, but while it seemed to fix the deadlock, the system
> > still got unusable.
> > The still running "vmstat 1" showed that the swapout was still
> > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> >
> > I also tried to additionally add Wu's patch:
> > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> > +++ linux-next/mm/vmscan.c A  A  A 2010-10-19 00:13:04.000000000 +0800
> > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> > A  A  A  A  A  A  A  isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> > A  A  A  }
> >
> > + A  A  A  /*
> > + A  A  A  A * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> > + A  A  A  A * they won't get blocked by normal ones and form circular deadlock.
> > + A  A  A  A */
> > + A  A  A  if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> > + A  A  A  A  A  A  A  inactive >>= 3;
> > +
> > A  A  A  return isolated > inactive;
> >
> > Either it did help somewhat, or I was more lucky on my second try, but
> > this time I needed ~5 tries instead of only 2 to get the system mostly
> > stuck again. On the testrun with Wu's patch the writeout pattern was
> > more stable, a burst of ~80kb each 20 seconds. But I would suspect
> > that the size of the burst is rather random.
> >
> > I do have a complete SysRq+T dump from the first run, I can send that
> > to anyone how wants it.
> > (It's 190k so I don't want not spam it to the list)
> 
> Is this call trace from the SysRq+T violation the rule to only
> allocate one bio from bio_alloc() until its submitted?
> 
> [  549.700038] Call Trace:
> [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> ffffffff81073c59
> [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> ffff88011e125fd8
> [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> ffff88011e125fd8
> 
> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> bio. That bio is the submitted, but the submit path seems to get into
> make_request from raid1.c and that allocates a second bio from
> bio_alloc() via bio_clone().
> 
> I am seeing this pattern (swap_writepage calling
> md_make_request/make_request and then getting stuck in mempool_alloc)
> more than 5 times in the SysRq+T output...

I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
inside mempool_alloc(), which can be fixed by this patch.

Thanks,
Fengguang
---

concurrent direct page reclaim problem

  __GFP_NORETRY page allocations may fail when there are many concurrent page
  allocating tasks, but not necessary in real short of memory. The root cause
  is, tasks will first run direct page reclaim to free some pages from the LRU
  lists and put them to the per-cpu page lists and the buddy system, and then
  try to get a free page from there.  However the free pages reclaimed by this
  task may be consumed by other tasks when the direct reclaim task is able to
  get the free page for itself.

  Let's retry it a bit harder.

--- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
+++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
@@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
 				unsigned long pages_reclaimed)
 {
 	/* Do not loop if specifically requested */
-	if (gfp_mask & __GFP_NORETRY)
+	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
 		return 0;
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20  5:57                     ` Wu Fengguang
@ 2010-10-20  7:05                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-20  7:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> > On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> > <just.for.lkml@googlemail.com> wrote:
> > > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> > >> Yes, thanks for the report.
> > >> This is a real bug exactly as you describe.
> > >>
> > >> This is how I think I will fix it, though it needs a bit of review and
> > >> testing before I can be certain.
> > >> Also I need to check raid10 etc to see if they can suffer too.
> > >>
> > >> If you can test it I would really appreciate it.
> > >
> > > I did test it, but while it seemed to fix the deadlock, the system
> > > still got unusable.
> > > The still running "vmstat 1" showed that the swapout was still
> > > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> > >
> > > I also tried to additionally add Wu's patch:
> > > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> > > +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
> > > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> > >               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> > >       }
> > >
> > > +       /*
> > > +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> > > +        * they won't get blocked by normal ones and form circular deadlock.
> > > +        */
> > > +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> > > +               inactive >>= 3;
> > > +
> > >       return isolated > inactive;
> > >
> > > Either it did help somewhat, or I was more lucky on my second try, but
> > > this time I needed ~5 tries instead of only 2 to get the system mostly
> > > stuck again. On the testrun with Wu's patch the writeout pattern was
> > > more stable, a burst of ~80kb each 20 seconds. But I would suspect
> > > that the size of the burst is rather random.
> > >
> > > I do have a complete SysRq+T dump from the first run, I can send that
> > > to anyone how wants it.
> > > (It's 190k so I don't want not spam it to the list)
> > 
> > Is this call trace from the SysRq+T violation the rule to only
> > allocate one bio from bio_alloc() until its submitted?
> > 
> > [  549.700038] Call Trace:
> > [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> > [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> > [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> > [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> > [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> > [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> > [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> > [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> > [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> > [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> > [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> > [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> > [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> > [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> > [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> > [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> > [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> > [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> > [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> > [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> > [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> > [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> > [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> > [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> > [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> > [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> > [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> > [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> > [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> > [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> > ffffffff81073c59
> > [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> > ffff88011e125fd8
> > [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> > ffff88011e125fd8
> > 
> > swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> > bio. That bio is the submitted, but the submit path seems to get into
> > make_request from raid1.c and that allocates a second bio from
> > bio_alloc() via bio_clone().
> > 
> > I am seeing this pattern (swap_writepage calling
> > md_make_request/make_request and then getting stuck in mempool_alloc)
> > more than 5 times in the SysRq+T output...
> 
> I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> inside mempool_alloc(), which can be fixed by this patch.
> 
> Thanks,
> Fengguang
> ---
> 
> concurrent direct page reclaim problem
> 
>   __GFP_NORETRY page allocations may fail when there are many concurrent page
>   allocating tasks, but not necessary in real short of memory. The root cause
>   is, tasks will first run direct page reclaim to free some pages from the LRU
>   lists and put them to the per-cpu page lists and the buddy system, and then
>   try to get a free page from there.  However the free pages reclaimed by this
>   task may be consumed by other tasks when the direct reclaim task is able to
>   get the free page for itself.
> 
>   Let's retry it a bit harder.
> 
> --- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
> +++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
> @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
>  				unsigned long pages_reclaimed)
>  {
>  	/* Do not loop if specifically requested */
> -	if (gfp_mask & __GFP_NORETRY)
> +	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
>  		return 0;
>  
>  	/*

SLUB usually try high order allocation with __GFP_NORETRY at first. In
other words, It strongly depend on __GFP_NORETRY don't any retry. I'm
worry this...

And, in this case, stucked tasks have PF_MEMALLOC. allocation with PF_MEMALLOC
failure mean this zone have zero memory purely. So, retrying don't solve anything.

And I think the root cause is in another.

bio_clone() use fs_bio_set internally.

	struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
	{
	        struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
	...

and fs_bio_set is initialized very small pool size.

	#define BIO_POOL_SIZE 2
	static int __init init_bio(void)
	{
		..
	        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);

So, I think raid1.c need to use their own bioset instead fs_bio_set.
otherwise, bio pool exshost can happen very easily.

But I'm not sure. I'm not IO expert.





^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20  7:05                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2010-10-20  7:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> > On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> > <just.for.lkml@googlemail.com> wrote:
> > > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> > >> Yes, thanks for the report.
> > >> This is a real bug exactly as you describe.
> > >>
> > >> This is how I think I will fix it, though it needs a bit of review and
> > >> testing before I can be certain.
> > >> Also I need to check raid10 etc to see if they can suffer too.
> > >>
> > >> If you can test it I would really appreciate it.
> > >
> > > I did test it, but while it seemed to fix the deadlock, the system
> > > still got unusable.
> > > The still running "vmstat 1" showed that the swapout was still
> > > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> > >
> > > I also tried to additionally add Wu's patch:
> > > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> > > +++ linux-next/mm/vmscan.c A  A  A 2010-10-19 00:13:04.000000000 +0800
> > > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> > > A  A  A  A  A  A  A  isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> > > A  A  A  }
> > >
> > > + A  A  A  /*
> > > + A  A  A  A * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> > > + A  A  A  A * they won't get blocked by normal ones and form circular deadlock.
> > > + A  A  A  A */
> > > + A  A  A  if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> > > + A  A  A  A  A  A  A  inactive >>= 3;
> > > +
> > > A  A  A  return isolated > inactive;
> > >
> > > Either it did help somewhat, or I was more lucky on my second try, but
> > > this time I needed ~5 tries instead of only 2 to get the system mostly
> > > stuck again. On the testrun with Wu's patch the writeout pattern was
> > > more stable, a burst of ~80kb each 20 seconds. But I would suspect
> > > that the size of the burst is rather random.
> > >
> > > I do have a complete SysRq+T dump from the first run, I can send that
> > > to anyone how wants it.
> > > (It's 190k so I don't want not spam it to the list)
> > 
> > Is this call trace from the SysRq+T violation the rule to only
> > allocate one bio from bio_alloc() until its submitted?
> > 
> > [  549.700038] Call Trace:
> > [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> > [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> > [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> > [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> > [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> > [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> > [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> > [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> > [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> > [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> > [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> > [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> > [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> > [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> > [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> > [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> > [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> > [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> > [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> > [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> > [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> > [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> > [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> > [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> > [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> > [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> > [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> > [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> > [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> > [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> > ffffffff81073c59
> > [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> > ffff88011e125fd8
> > [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> > ffff88011e125fd8
> > 
> > swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> > bio. That bio is the submitted, but the submit path seems to get into
> > make_request from raid1.c and that allocates a second bio from
> > bio_alloc() via bio_clone().
> > 
> > I am seeing this pattern (swap_writepage calling
> > md_make_request/make_request and then getting stuck in mempool_alloc)
> > more than 5 times in the SysRq+T output...
> 
> I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> inside mempool_alloc(), which can be fixed by this patch.
> 
> Thanks,
> Fengguang
> ---
> 
> concurrent direct page reclaim problem
> 
>   __GFP_NORETRY page allocations may fail when there are many concurrent page
>   allocating tasks, but not necessary in real short of memory. The root cause
>   is, tasks will first run direct page reclaim to free some pages from the LRU
>   lists and put them to the per-cpu page lists and the buddy system, and then
>   try to get a free page from there.  However the free pages reclaimed by this
>   task may be consumed by other tasks when the direct reclaim task is able to
>   get the free page for itself.
> 
>   Let's retry it a bit harder.
> 
> --- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
> +++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
> @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
>  				unsigned long pages_reclaimed)
>  {
>  	/* Do not loop if specifically requested */
> -	if (gfp_mask & __GFP_NORETRY)
> +	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
>  		return 0;
>  
>  	/*

SLUB usually try high order allocation with __GFP_NORETRY at first. In
other words, It strongly depend on __GFP_NORETRY don't any retry. I'm
worry this...

And, in this case, stucked tasks have PF_MEMALLOC. allocation with PF_MEMALLOC
failure mean this zone have zero memory purely. So, retrying don't solve anything.

And I think the root cause is in another.

bio_clone() use fs_bio_set internally.

	struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
	{
	        struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
	...

and fs_bio_set is initialized very small pool size.

	#define BIO_POOL_SIZE 2
	static int __init init_bio(void)
	{
		..
	        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);

So, I think raid1.c need to use their own bioset instead fs_bio_set.
otherwise, bio pool exshost can happen very easily.

But I'm not sure. I'm not IO expert.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20  5:57                     ` Wu Fengguang
@ 2010-10-20  7:25                       ` Torsten Kaiser
  -1 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-20  7:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>> bio. That bio is the submitted, but the submit path seems to get into
>> make_request from raid1.c and that allocates a second bio from
>> bio_alloc() via bio_clone().
>>
>> I am seeing this pattern (swap_writepage calling
>> md_make_request/make_request and then getting stuck in mempool_alloc)
>> more than 5 times in the SysRq+T output...
>
> I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> inside mempool_alloc(), which can be fixed by this patch.

No. I tested the patch (ontop of Neils fix and your patch regarding
too_many_isolated()), but the system got stuck the same way on the
first try to fill the tmpfs.
I think the basic problem is, that the mempool that should guarantee
progress is exhausted because the raid1 device is stacked between the
pageout code and the disks and so the "use only 1 bio"-rule gets
violated.

> Thanks,
> Fengguang
> ---
>
> concurrent direct page reclaim problem
>
>  __GFP_NORETRY page allocations may fail when there are many concurrent page
>  allocating tasks, but not necessary in real short of memory. The root cause
>  is, tasks will first run direct page reclaim to free some pages from the LRU
>  lists and put them to the per-cpu page lists and the buddy system, and then
>  try to get a free page from there.  However the free pages reclaimed by this
>  task may be consumed by other tasks when the direct reclaim task is able to
>  get the free page for itself.

I believe the facts disagree with that assumtion. My bad for not
posting this before, but I also used SysRq+M to see whats going on,
but each time there still was some free memory.
Here is the SysRq+M output from the run with only Neils patch applied,
but on each other run the same ~14Mb stayed free

[  437.481365] SysRq : Show Memory
[  437.490003] Mem-Info:
[  437.491357] Node 0 DMA per-cpu:
[  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
[  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
[  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
[  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
[  437.500032] Node 0 DMA32 per-cpu:
[  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
[  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
[  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
[  437.500032] Node 1 DMA32 per-cpu:
[  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
[  437.500032] Node 1 Normal per-cpu:
[  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
[  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
[  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
[  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
[  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
[  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
[  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
[  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
active_anon:0kB inact
ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB i
solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
writeback:404kB mapped:0kB shme
m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
kernel_stack:0kB pagetables:0kB
 unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
all_unreclaimable? no
[  437.500032] lowmem_reserve[]: 0 2004 2004 2004
[  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
high:6052kB active_anon:2
844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
unevictable:0kB isolated(anon):1232kB isolated(file):0kB
present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[  437.500032] lowmem_reserve[]: 0 0 0 0
[  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
inactive_file:28kB unevictable:0kB isolated(anon):768kB
isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  437.500032] lowmem_reserve[]: 0 0 505 505
[  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
high:1524kB active_anon:5312kB inactive_anon:459544kB
active_file:3228kB inactive_file:3084kB unevictable:0kB
isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:9678 all_unreclaimable? no
[  437.500032] lowmem_reserve[]: 0 0 0 0
[  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
[  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
[  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
[  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
[  437.500032] 989289 total pagecache pages
[  437.500032] 25398 pages in swap cache
[  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
[  437.500032] Free swap  = 9865628kB
[  437.500032] Total swap = 10000316kB
[  437.500032] 1048575 pages RAM
[  437.500032] 33809 pages reserved
[  437.500032] 7996 pages shared
[  437.500032] 1008521 pages non-shared


>  Let's retry it a bit harder.
>
> --- linux-next.orig/mm/page_alloc.c     2010-10-20 13:44:50.000000000 +0800
> +++ linux-next/mm/page_alloc.c  2010-10-20 13:50:54.000000000 +0800
> @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
>                                unsigned long pages_reclaimed)
>  {
>        /* Do not loop if specifically requested */
> -       if (gfp_mask & __GFP_NORETRY)
> +       if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
>                return 0;
>
>        /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20  7:25                       ` Torsten Kaiser
  0 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-20  7:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>> bio. That bio is the submitted, but the submit path seems to get into
>> make_request from raid1.c and that allocates a second bio from
>> bio_alloc() via bio_clone().
>>
>> I am seeing this pattern (swap_writepage calling
>> md_make_request/make_request and then getting stuck in mempool_alloc)
>> more than 5 times in the SysRq+T output...
>
> I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> inside mempool_alloc(), which can be fixed by this patch.

No. I tested the patch (ontop of Neils fix and your patch regarding
too_many_isolated()), but the system got stuck the same way on the
first try to fill the tmpfs.
I think the basic problem is, that the mempool that should guarantee
progress is exhausted because the raid1 device is stacked between the
pageout code and the disks and so the "use only 1 bio"-rule gets
violated.

> Thanks,
> Fengguang
> ---
>
> concurrent direct page reclaim problem
>
>  __GFP_NORETRY page allocations may fail when there are many concurrent page
>  allocating tasks, but not necessary in real short of memory. The root cause
>  is, tasks will first run direct page reclaim to free some pages from the LRU
>  lists and put them to the per-cpu page lists and the buddy system, and then
>  try to get a free page from there.  However the free pages reclaimed by this
>  task may be consumed by other tasks when the direct reclaim task is able to
>  get the free page for itself.

I believe the facts disagree with that assumtion. My bad for not
posting this before, but I also used SysRq+M to see whats going on,
but each time there still was some free memory.
Here is the SysRq+M output from the run with only Neils patch applied,
but on each other run the same ~14Mb stayed free

[  437.481365] SysRq : Show Memory
[  437.490003] Mem-Info:
[  437.491357] Node 0 DMA per-cpu:
[  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
[  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
[  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
[  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
[  437.500032] Node 0 DMA32 per-cpu:
[  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
[  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
[  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
[  437.500032] Node 1 DMA32 per-cpu:
[  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
[  437.500032] Node 1 Normal per-cpu:
[  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
[  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
[  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
[  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
[  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
[  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
[  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
[  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
[  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
active_anon:0kB inact
ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB i
solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
writeback:404kB mapped:0kB shme
m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
kernel_stack:0kB pagetables:0kB
 unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
all_unreclaimable? no
[  437.500032] lowmem_reserve[]: 0 2004 2004 2004
[  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
high:6052kB active_anon:2
844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
unevictable:0kB isolated(anon):1232kB isolated(file):0kB
present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[  437.500032] lowmem_reserve[]: 0 0 0 0
[  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
inactive_file:28kB unevictable:0kB isolated(anon):768kB
isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  437.500032] lowmem_reserve[]: 0 0 505 505
[  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
high:1524kB active_anon:5312kB inactive_anon:459544kB
active_file:3228kB inactive_file:3084kB unevictable:0kB
isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:9678 all_unreclaimable? no
[  437.500032] lowmem_reserve[]: 0 0 0 0
[  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
[  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
[  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
[  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
[  437.500032] 989289 total pagecache pages
[  437.500032] 25398 pages in swap cache
[  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
[  437.500032] Free swap  = 9865628kB
[  437.500032] Total swap = 10000316kB
[  437.500032] 1048575 pages RAM
[  437.500032] 33809 pages reserved
[  437.500032] 7996 pages shared
[  437.500032] 1008521 pages non-shared


>  Let's retry it a bit harder.
>
> --- linux-next.orig/mm/page_alloc.c     2010-10-20 13:44:50.000000000 +0800
> +++ linux-next/mm/page_alloc.c  2010-10-20 13:50:54.000000000 +0800
> @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
>                                unsigned long pages_reclaimed)
>  {
>        /* Do not loop if specifically requested */
> -       if (gfp_mask & __GFP_NORETRY)
> +       if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
>                return 0;
>
>        /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20  7:25                       ` Torsten Kaiser
@ 2010-10-20  9:01                         ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-20  9:01 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Wed, Oct 20, 2010 at 03:25:49PM +0800, Torsten Kaiser wrote:
> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> >> bio. That bio is the submitted, but the submit path seems to get into
> >> make_request from raid1.c and that allocates a second bio from
> >> bio_alloc() via bio_clone().
> >>
> >> I am seeing this pattern (swap_writepage calling
> >> md_make_request/make_request and then getting stuck in mempool_alloc)
> >> more than 5 times in the SysRq+T output...
> >
> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> > inside mempool_alloc(), which can be fixed by this patch.
> 
> No. I tested the patch (ontop of Neils fix and your patch regarding
> too_many_isolated()), but the system got stuck the same way on the
> first try to fill the tmpfs.
> I think the basic problem is, that the mempool that should guarantee
> progress is exhausted because the raid1 device is stacked between the
> pageout code and the disks and so the "use only 1 bio"-rule gets
> violated.

The mempool get exhausted because pool->alloc() failed at least 2
times. But there are no such high memory pressure except for some
parallel reclaimers. It seems the below patch does not completely
stop the page allocation failure, hence does not stop the deadlock.

As you and KOSAKI said, the root cause is BIO_POOL_SIZE being smaller
than the total possible allocations in the IO stack. Then why not
bumping up BIO_POOL_SIZE to something like 64? It will be large enough
to allow multiple stacked IO layers.

And the larger value will allow more concurrent flying IOs for better
IO throughput in such situation. Commit 5972511b7 lowers it from 256
to 2 because it believes that pool->alloc() will only fail on somehow
OOM situation. However truth is __GFP_NORETRY allocations fail much
more easily in _normal_ operations (whenever there are multiple
concurrent page reclaimers). We have to be able to perform better in
such situation.  The __GFP_NORETRY patch to reduce failures is one
option, increasing BIO_POOL_SIZE is another.

So would you try this fix?

--- linux-next.orig/include/linux/bio.h	2010-10-20 16:55:57.000000000 +0800
+++ linux-next/include/linux/bio.h	2010-10-20 16:56:54.000000000 +0800
@@ -286,7 +286,7 @@ static inline void bio_set_completion_cp
  * These memory pools in turn all allocate from the bio_slab
  * and the bvec_slabs[].
  */
-#define BIO_POOL_SIZE 2
+#define BIO_POOL_SIZE	64
 #define BIOVEC_NR_POOLS 6
 #define BIOVEC_MAX_IDX	(BIOVEC_NR_POOLS - 1)
 
Thanks,
Fengguang

> > ---
> >
> > concurrent direct page reclaim problem
> >
> >  __GFP_NORETRY page allocations may fail when there are many concurrent page
> >  allocating tasks, but not necessary in real short of memory. The root cause
> >  is, tasks will first run direct page reclaim to free some pages from the LRU
> >  lists and put them to the per-cpu page lists and the buddy system, and then
> >  try to get a free page from there.  However the free pages reclaimed by this
> >  task may be consumed by other tasks when the direct reclaim task is able to
> >  get the free page for itself.
> 
> I believe the facts disagree with that assumtion. My bad for not
> posting this before, but I also used SysRq+M to see whats going on,
> but each time there still was some free memory.
> Here is the SysRq+M output from the run with only Neils patch applied,
> but on each other run the same ~14Mb stayed free
> 
> [  437.481365] SysRq : Show Memory
> [  437.490003] Mem-Info:
> [  437.491357] Node 0 DMA per-cpu:
> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
> [  437.500032] Node 0 DMA32 per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
> [  437.500032] Node 1 DMA32 per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
> [  437.500032] Node 1 Normal per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
> active_anon:0kB inact
> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB i
> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
> writeback:404kB mapped:0kB shme
> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
> kernel_stack:0kB pagetables:0kB
>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
> all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004
> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
> high:6052kB active_anon:2
> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 0 0
> [  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
> inactive_file:28kB unevictable:0kB isolated(anon):768kB
> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 505 505
> [  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
> high:1524kB active_anon:5312kB inactive_anon:459544kB
> active_file:3228kB inactive_file:3084kB unevictable:0kB
> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:9678 all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 0 0
> [  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
> [  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
> [  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
> [  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
> [  437.500032] 989289 total pagecache pages
> [  437.500032] 25398 pages in swap cache
> [  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
> [  437.500032] Free swap  = 9865628kB
> [  437.500032] Total swap = 10000316kB
> [  437.500032] 1048575 pages RAM
> [  437.500032] 33809 pages reserved
> [  437.500032] 7996 pages shared
> [  437.500032] 1008521 pages non-shared
> 
> 
> >  Let's retry it a bit harder.
> >
> > --- linux-next.orig/mm/page_alloc.c     2010-10-20 13:44:50.000000000 +0800
> > +++ linux-next/mm/page_alloc.c  2010-10-20 13:50:54.000000000 +0800
> > @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
> >                                unsigned long pages_reclaimed)
> >  {
> >        /* Do not loop if specifically requested */
> > -       if (gfp_mask & __GFP_NORETRY)
> > +       if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
> >                return 0;
> >
> >        /*
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20  9:01                         ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-20  9:01 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Wed, Oct 20, 2010 at 03:25:49PM +0800, Torsten Kaiser wrote:
> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> >> bio. That bio is the submitted, but the submit path seems to get into
> >> make_request from raid1.c and that allocates a second bio from
> >> bio_alloc() via bio_clone().
> >>
> >> I am seeing this pattern (swap_writepage calling
> >> md_make_request/make_request and then getting stuck in mempool_alloc)
> >> more than 5 times in the SysRq+T output...
> >
> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> > inside mempool_alloc(), which can be fixed by this patch.
> 
> No. I tested the patch (ontop of Neils fix and your patch regarding
> too_many_isolated()), but the system got stuck the same way on the
> first try to fill the tmpfs.
> I think the basic problem is, that the mempool that should guarantee
> progress is exhausted because the raid1 device is stacked between the
> pageout code and the disks and so the "use only 1 bio"-rule gets
> violated.

The mempool get exhausted because pool->alloc() failed at least 2
times. But there are no such high memory pressure except for some
parallel reclaimers. It seems the below patch does not completely
stop the page allocation failure, hence does not stop the deadlock.

As you and KOSAKI said, the root cause is BIO_POOL_SIZE being smaller
than the total possible allocations in the IO stack. Then why not
bumping up BIO_POOL_SIZE to something like 64? It will be large enough
to allow multiple stacked IO layers.

And the larger value will allow more concurrent flying IOs for better
IO throughput in such situation. Commit 5972511b7 lowers it from 256
to 2 because it believes that pool->alloc() will only fail on somehow
OOM situation. However truth is __GFP_NORETRY allocations fail much
more easily in _normal_ operations (whenever there are multiple
concurrent page reclaimers). We have to be able to perform better in
such situation.  The __GFP_NORETRY patch to reduce failures is one
option, increasing BIO_POOL_SIZE is another.

So would you try this fix?

--- linux-next.orig/include/linux/bio.h	2010-10-20 16:55:57.000000000 +0800
+++ linux-next/include/linux/bio.h	2010-10-20 16:56:54.000000000 +0800
@@ -286,7 +286,7 @@ static inline void bio_set_completion_cp
  * These memory pools in turn all allocate from the bio_slab
  * and the bvec_slabs[].
  */
-#define BIO_POOL_SIZE 2
+#define BIO_POOL_SIZE	64
 #define BIOVEC_NR_POOLS 6
 #define BIOVEC_MAX_IDX	(BIOVEC_NR_POOLS - 1)
 
Thanks,
Fengguang

> > ---
> >
> > concurrent direct page reclaim problem
> >
> > A __GFP_NORETRY page allocations may fail when there are many concurrent page
> > A allocating tasks, but not necessary in real short of memory. The root cause
> > A is, tasks will first run direct page reclaim to free some pages from the LRU
> > A lists and put them to the per-cpu page lists and the buddy system, and then
> > A try to get a free page from there. A However the free pages reclaimed by this
> > A task may be consumed by other tasks when the direct reclaim task is able to
> > A get the free page for itself.
> 
> I believe the facts disagree with that assumtion. My bad for not
> posting this before, but I also used SysRq+M to see whats going on,
> but each time there still was some free memory.
> Here is the SysRq+M output from the run with only Neils patch applied,
> but on each other run the same ~14Mb stayed free
> 
> [  437.481365] SysRq : Show Memory
> [  437.490003] Mem-Info:
> [  437.491357] Node 0 DMA per-cpu:
> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
> [  437.500032] Node 0 DMA32 per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
> [  437.500032] Node 1 DMA32 per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
> [  437.500032] Node 1 Normal per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
> active_anon:0kB inact
> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB i
> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
> writeback:404kB mapped:0kB shme
> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
> kernel_stack:0kB pagetables:0kB
>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
> all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004
> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
> high:6052kB active_anon:2
> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 0 0
> [  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
> inactive_file:28kB unevictable:0kB isolated(anon):768kB
> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 505 505
> [  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
> high:1524kB active_anon:5312kB inactive_anon:459544kB
> active_file:3228kB inactive_file:3084kB unevictable:0kB
> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:9678 all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 0 0
> [  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
> [  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
> [  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
> [  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
> [  437.500032] 989289 total pagecache pages
> [  437.500032] 25398 pages in swap cache
> [  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
> [  437.500032] Free swap  = 9865628kB
> [  437.500032] Total swap = 10000316kB
> [  437.500032] 1048575 pages RAM
> [  437.500032] 33809 pages reserved
> [  437.500032] 7996 pages shared
> [  437.500032] 1008521 pages non-shared
> 
> 
> > A Let's retry it a bit harder.
> >
> > --- linux-next.orig/mm/page_alloc.c A  A  2010-10-20 13:44:50.000000000 +0800
> > +++ linux-next/mm/page_alloc.c A 2010-10-20 13:50:54.000000000 +0800
> > @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A unsigned long pages_reclaimed)
> > A {
> > A  A  A  A /* Do not loop if specifically requested */
> > - A  A  A  if (gfp_mask & __GFP_NORETRY)
> > + A  A  A  if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
> > A  A  A  A  A  A  A  A return 0;
> >
> > A  A  A  A /*
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at A http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at A http://www.tux.org/lkml/
> >

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20  7:05                       ` KOSAKI Motohiro
@ 2010-10-20  9:27                         ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-20  9:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Torsten Kaiser, Neil Brown, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua,
	Jens Axboe

On Wed, Oct 20, 2010 at 03:05:56PM +0800, KOSAKI Motohiro wrote:
> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> > > On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> > > <just.for.lkml@googlemail.com> wrote:
> > > > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> > > >> Yes, thanks for the report.
> > > >> This is a real bug exactly as you describe.
> > > >>
> > > >> This is how I think I will fix it, though it needs a bit of review and
> > > >> testing before I can be certain.
> > > >> Also I need to check raid10 etc to see if they can suffer too.
> > > >>
> > > >> If you can test it I would really appreciate it.
> > > >
> > > > I did test it, but while it seemed to fix the deadlock, the system
> > > > still got unusable.
> > > > The still running "vmstat 1" showed that the swapout was still
> > > > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> > > >
> > > > I also tried to additionally add Wu's patch:
> > > > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> > > > +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
> > > > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> > > >               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> > > >       }
> > > >
> > > > +       /*
> > > > +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> > > > +        * they won't get blocked by normal ones and form circular deadlock.
> > > > +        */
> > > > +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> > > > +               inactive >>= 3;
> > > > +
> > > >       return isolated > inactive;
> > > >
> > > > Either it did help somewhat, or I was more lucky on my second try, but
> > > > this time I needed ~5 tries instead of only 2 to get the system mostly
> > > > stuck again. On the testrun with Wu's patch the writeout pattern was
> > > > more stable, a burst of ~80kb each 20 seconds. But I would suspect
> > > > that the size of the burst is rather random.
> > > >
> > > > I do have a complete SysRq+T dump from the first run, I can send that
> > > > to anyone how wants it.
> > > > (It's 190k so I don't want not spam it to the list)
> > > 
> > > Is this call trace from the SysRq+T violation the rule to only
> > > allocate one bio from bio_alloc() until its submitted?
> > > 
> > > [  549.700038] Call Trace:
> > > [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> > > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > > [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> > > [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> > > [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> > > [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> > > [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> > > [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> > > [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> > > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > > [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> > > [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> > > [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> > > [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> > > [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> > > [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> > > [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> > > [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> > > [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> > > [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> > > [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> > > [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> > > [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> > > [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> > > [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> > > [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> > > [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> > > [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> > > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > > [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> > > [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> > > [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> > > [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> > > [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> > > [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> > > [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> > > ffffffff81073c59
> > > [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> > > ffff88011e125fd8
> > > [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> > > ffff88011e125fd8
> > > 
> > > swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> > > bio. That bio is the submitted, but the submit path seems to get into
> > > make_request from raid1.c and that allocates a second bio from
> > > bio_alloc() via bio_clone().
> > > 
> > > I am seeing this pattern (swap_writepage calling
> > > md_make_request/make_request and then getting stuck in mempool_alloc)
> > > more than 5 times in the SysRq+T output...
> > 
> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> > inside mempool_alloc(), which can be fixed by this patch.
> > 
> > Thanks,
> > Fengguang
> > ---
> > 
> > concurrent direct page reclaim problem
> > 
> >   __GFP_NORETRY page allocations may fail when there are many concurrent page
> >   allocating tasks, but not necessary in real short of memory. The root cause
> >   is, tasks will first run direct page reclaim to free some pages from the LRU
> >   lists and put them to the per-cpu page lists and the buddy system, and then
> >   try to get a free page from there.  However the free pages reclaimed by this
> >   task may be consumed by other tasks when the direct reclaim task is able to
> >   get the free page for itself.
> > 
> >   Let's retry it a bit harder.
> > 
> > --- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
> > +++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
> > @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
> >  				unsigned long pages_reclaimed)
> >  {
> >  	/* Do not loop if specifically requested */
> > -	if (gfp_mask & __GFP_NORETRY)
> > +	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
> >  		return 0;
> >  
> >  	/*
> 
> SLUB usually try high order allocation with __GFP_NORETRY at first. In
> other words, It strongly depend on __GFP_NORETRY don't any retry. I'm
> worry this...

Right. I noticed that too. Hopefully the "limited" retry won't impact
it too much. That said, we do need a better solution than such hacks.

> And, in this case, stucked tasks have PF_MEMALLOC. allocation with PF_MEMALLOC
> failure mean this zone have zero memory purely. So, retrying don't solve anything.

The zone has no free (buddy) memory, but has plenty of reclaimable pages.
The concurrent page reclaimers may steal pages reclaimed by this task
from time to time, but not always. So retry reclaiming will help.

> And I think the root cause is in another.
> 
> bio_clone() use fs_bio_set internally.
> 
> 	struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> 	{
> 	        struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
> 	...
> 
> and fs_bio_set is initialized very small pool size.
> 
> 	#define BIO_POOL_SIZE 2
> 	static int __init init_bio(void)
> 	{
> 		..
> 	        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);

Agreed. BIO_POOL_SIZE=2 is too small to be deadlock free.

> So, I think raid1.c need to use their own bioset instead fs_bio_set.
> otherwise, bio pool exshost can happen very easily.

That would fix the deadlock, but not enough for good IO throughput
when multiple CPUs are trying to submit IO. Increasing BIO_POOL_SIZE
to a large value should help fix both the deadlock and IO throughput.

> But I'm not sure. I'm not IO expert.

[add CC to Jens]

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20  9:27                         ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-20  9:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Torsten Kaiser, Neil Brown, Rik van Riel, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua,
	Jens Axboe

On Wed, Oct 20, 2010 at 03:05:56PM +0800, KOSAKI Motohiro wrote:
> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> > > On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> > > <just.for.lkml@googlemail.com> wrote:
> > > > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> > > >> Yes, thanks for the report.
> > > >> This is a real bug exactly as you describe.
> > > >>
> > > >> This is how I think I will fix it, though it needs a bit of review and
> > > >> testing before I can be certain.
> > > >> Also I need to check raid10 etc to see if they can suffer too.
> > > >>
> > > >> If you can test it I would really appreciate it.
> > > >
> > > > I did test it, but while it seemed to fix the deadlock, the system
> > > > still got unusable.
> > > > The still running "vmstat 1" showed that the swapout was still
> > > > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> > > >
> > > > I also tried to additionally add Wu's patch:
> > > > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> > > > +++ linux-next/mm/vmscan.c A  A  A 2010-10-19 00:13:04.000000000 +0800
> > > > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> > > > A  A  A  A  A  A  A  isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> > > > A  A  A  }
> > > >
> > > > + A  A  A  /*
> > > > + A  A  A  A * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> > > > + A  A  A  A * they won't get blocked by normal ones and form circular deadlock.
> > > > + A  A  A  A */
> > > > + A  A  A  if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> > > > + A  A  A  A  A  A  A  inactive >>= 3;
> > > > +
> > > > A  A  A  return isolated > inactive;
> > > >
> > > > Either it did help somewhat, or I was more lucky on my second try, but
> > > > this time I needed ~5 tries instead of only 2 to get the system mostly
> > > > stuck again. On the testrun with Wu's patch the writeout pattern was
> > > > more stable, a burst of ~80kb each 20 seconds. But I would suspect
> > > > that the size of the burst is rather random.
> > > >
> > > > I do have a complete SysRq+T dump from the first run, I can send that
> > > > to anyone how wants it.
> > > > (It's 190k so I don't want not spam it to the list)
> > > 
> > > Is this call trace from the SysRq+T violation the rule to only
> > > allocate one bio from bio_alloc() until its submitted?
> > > 
> > > [  549.700038] Call Trace:
> > > [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> > > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > > [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> > > [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> > > [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> > > [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> > > [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> > > [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> > > [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> > > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > > [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> > > [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> > > [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> > > [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> > > [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> > > [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> > > [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> > > [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> > > [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> > > [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> > > [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> > > [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> > > [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> > > [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> > > [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> > > [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> > > [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> > > [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> > > [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> > > [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> > > [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> > > [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> > > [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> > > [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> > > [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> > > [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> > > ffffffff81073c59
> > > [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> > > ffff88011e125fd8
> > > [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> > > ffff88011e125fd8
> > > 
> > > swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> > > bio. That bio is the submitted, but the submit path seems to get into
> > > make_request from raid1.c and that allocates a second bio from
> > > bio_alloc() via bio_clone().
> > > 
> > > I am seeing this pattern (swap_writepage calling
> > > md_make_request/make_request and then getting stuck in mempool_alloc)
> > > more than 5 times in the SysRq+T output...
> > 
> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> > inside mempool_alloc(), which can be fixed by this patch.
> > 
> > Thanks,
> > Fengguang
> > ---
> > 
> > concurrent direct page reclaim problem
> > 
> >   __GFP_NORETRY page allocations may fail when there are many concurrent page
> >   allocating tasks, but not necessary in real short of memory. The root cause
> >   is, tasks will first run direct page reclaim to free some pages from the LRU
> >   lists and put them to the per-cpu page lists and the buddy system, and then
> >   try to get a free page from there.  However the free pages reclaimed by this
> >   task may be consumed by other tasks when the direct reclaim task is able to
> >   get the free page for itself.
> > 
> >   Let's retry it a bit harder.
> > 
> > --- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
> > +++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
> > @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
> >  				unsigned long pages_reclaimed)
> >  {
> >  	/* Do not loop if specifically requested */
> > -	if (gfp_mask & __GFP_NORETRY)
> > +	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
> >  		return 0;
> >  
> >  	/*
> 
> SLUB usually try high order allocation with __GFP_NORETRY at first. In
> other words, It strongly depend on __GFP_NORETRY don't any retry. I'm
> worry this...

Right. I noticed that too. Hopefully the "limited" retry won't impact
it too much. That said, we do need a better solution than such hacks.

> And, in this case, stucked tasks have PF_MEMALLOC. allocation with PF_MEMALLOC
> failure mean this zone have zero memory purely. So, retrying don't solve anything.

The zone has no free (buddy) memory, but has plenty of reclaimable pages.
The concurrent page reclaimers may steal pages reclaimed by this task
from time to time, but not always. So retry reclaiming will help.

> And I think the root cause is in another.
> 
> bio_clone() use fs_bio_set internally.
> 
> 	struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> 	{
> 	        struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
> 	...
> 
> and fs_bio_set is initialized very small pool size.
> 
> 	#define BIO_POOL_SIZE 2
> 	static int __init init_bio(void)
> 	{
> 		..
> 	        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);

Agreed. BIO_POOL_SIZE=2 is too small to be deadlock free.

> So, I think raid1.c need to use their own bioset instead fs_bio_set.
> otherwise, bio pool exshost can happen very easily.

That would fix the deadlock, but not enough for good IO throughput
when multiple CPUs are trying to submit IO. Increasing BIO_POOL_SIZE
to a large value should help fix both the deadlock and IO throughput.

> But I'm not sure. I'm not IO expert.

[add CC to Jens]

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20  9:01                         ` Wu Fengguang
@ 2010-10-20 10:07                           ` Torsten Kaiser
  -1 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-20 10:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Wed, Oct 20, 2010 at 11:01 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Wed, Oct 20, 2010 at 03:25:49PM +0800, Torsten Kaiser wrote:
>> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>> >> bio. That bio is the submitted, but the submit path seems to get into
>> >> make_request from raid1.c and that allocates a second bio from
>> >> bio_alloc() via bio_clone().
>> >>
>> >> I am seeing this pattern (swap_writepage calling
>> >> md_make_request/make_request and then getting stuck in mempool_alloc)
>> >> more than 5 times in the SysRq+T output...
>> >
>> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
>> > inside mempool_alloc(), which can be fixed by this patch.
>>
>> No. I tested the patch (ontop of Neils fix and your patch regarding
>> too_many_isolated()), but the system got stuck the same way on the
>> first try to fill the tmpfs.
>> I think the basic problem is, that the mempool that should guarantee
>> progress is exhausted because the raid1 device is stacked between the
>> pageout code and the disks and so the "use only 1 bio"-rule gets
>> violated.
>
> The mempool get exhausted because pool->alloc() failed at least 2
> times. But there are no such high memory pressure except for some
> parallel reclaimers. It seems the below patch does not completely
> stop the page allocation failure, hence does not stop the deadlock.
>
> As you and KOSAKI said, the root cause is BIO_POOL_SIZE being smaller
> than the total possible allocations in the IO stack. Then why not
> bumping up BIO_POOL_SIZE to something like 64? It will be large enough
> to allow multiple stacked IO layers.
>
> And the larger value will allow more concurrent flying IOs for better
> IO throughput in such situation. Commit 5972511b7 lowers it from 256
> to 2 because it believes that pool->alloc() will only fail on somehow
> OOM situation. However truth is __GFP_NORETRY allocations fail much
> more easily in _normal_ operations (whenever there are multiple
> concurrent page reclaimers). We have to be able to perform better in
> such situation.  The __GFP_NORETRY patch to reduce failures is one
> option, increasing BIO_POOL_SIZE is another.
>
> So would you try this fix?

While it seems to fix the hang (Just vanilla -rc8, even without the
fix from Neil to raid1.c, did not hang during multiple runs of my
testscript), I believe this is not a fix for the problem.

To quote comment above bio_alloc() from fs/bio.c:
 *      If %__GFP_WAIT is set, then bio_alloc will always be able to allocate
 *      a bio. This is due to the mempool guarantees. To make this work, callers
 *      must never allocate more than 1 bio at a time from this pool. Callers
 *      that need to allocate more than 1 bio must always submit the previously
 *      allocated bio for IO before attempting to allocate a new one. Failure to
 *      do so can cause livelocks under memory pressure.

So it seems that limiting fs_bio_set to only 2 entries was intended.
And the commit comment from the change that reduced this from 256 to 2
even said that this pool is only intended as a last resort to sustain
progress.

And I think in my testcase the important thing is not good
performance, but only to make sure the system does not hang.
Both of the situations that cause the hang for me where more cases of
"what not to do" then anything that should perform good. In the
original case the system started too many gcc's because I'm to lazy to
figure out a better way to organize the parallelization of independent
singlethreaded compiles and parallel makes. The Gentoo package manager
tries to use the load average to that point, but this is foiled if a
compile first has a singlethreaded part (like configure) and only
later switches to parallel compiles. So portage started 4 package
compilations, because during configure the load was low and then the
system had to deal with 20 gcc's (make -j5) eating all of its memory.
And even that seemed to only happen during one part of the compiles,
as in the not hanging cases, the swapping soon stopped. So it would
just have to survive the initial overallocation with the small mempool
and everything is find.
My reduced testcase is even more useless as an example for a real
load: I'm just using multiple dd's to fill a tmpfs as fast as I can to
see if raid1.c::make_request() breaks under memory pressure. And here
too the only goal should be that the kernel should survive this abuse.

Sorry, but increasing BIO_POOL_SIZE just looks like papering over the
real problem...


Torsten

> --- linux-next.orig/include/linux/bio.h 2010-10-20 16:55:57.000000000 +0800
> +++ linux-next/include/linux/bio.h      2010-10-20 16:56:54.000000000 +0800
> @@ -286,7 +286,7 @@ static inline void bio_set_completion_cp
>  * These memory pools in turn all allocate from the bio_slab
>  * and the bvec_slabs[].
>  */
> -#define BIO_POOL_SIZE 2
> +#define BIO_POOL_SIZE  64
>  #define BIOVEC_NR_POOLS 6
>  #define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20 10:07                           ` Torsten Kaiser
  0 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-20 10:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Neil Brown, Rik van Riel, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li, Shaohua

On Wed, Oct 20, 2010 at 11:01 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Wed, Oct 20, 2010 at 03:25:49PM +0800, Torsten Kaiser wrote:
>> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>> >> bio. That bio is the submitted, but the submit path seems to get into
>> >> make_request from raid1.c and that allocates a second bio from
>> >> bio_alloc() via bio_clone().
>> >>
>> >> I am seeing this pattern (swap_writepage calling
>> >> md_make_request/make_request and then getting stuck in mempool_alloc)
>> >> more than 5 times in the SysRq+T output...
>> >
>> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
>> > inside mempool_alloc(), which can be fixed by this patch.
>>
>> No. I tested the patch (ontop of Neils fix and your patch regarding
>> too_many_isolated()), but the system got stuck the same way on the
>> first try to fill the tmpfs.
>> I think the basic problem is, that the mempool that should guarantee
>> progress is exhausted because the raid1 device is stacked between the
>> pageout code and the disks and so the "use only 1 bio"-rule gets
>> violated.
>
> The mempool get exhausted because pool->alloc() failed at least 2
> times. But there are no such high memory pressure except for some
> parallel reclaimers. It seems the below patch does not completely
> stop the page allocation failure, hence does not stop the deadlock.
>
> As you and KOSAKI said, the root cause is BIO_POOL_SIZE being smaller
> than the total possible allocations in the IO stack. Then why not
> bumping up BIO_POOL_SIZE to something like 64? It will be large enough
> to allow multiple stacked IO layers.
>
> And the larger value will allow more concurrent flying IOs for better
> IO throughput in such situation. Commit 5972511b7 lowers it from 256
> to 2 because it believes that pool->alloc() will only fail on somehow
> OOM situation. However truth is __GFP_NORETRY allocations fail much
> more easily in _normal_ operations (whenever there are multiple
> concurrent page reclaimers). We have to be able to perform better in
> such situation.  The __GFP_NORETRY patch to reduce failures is one
> option, increasing BIO_POOL_SIZE is another.
>
> So would you try this fix?

While it seems to fix the hang (Just vanilla -rc8, even without the
fix from Neil to raid1.c, did not hang during multiple runs of my
testscript), I believe this is not a fix for the problem.

To quote comment above bio_alloc() from fs/bio.c:
 *      If %__GFP_WAIT is set, then bio_alloc will always be able to allocate
 *      a bio. This is due to the mempool guarantees. To make this work, callers
 *      must never allocate more than 1 bio at a time from this pool. Callers
 *      that need to allocate more than 1 bio must always submit the previously
 *      allocated bio for IO before attempting to allocate a new one. Failure to
 *      do so can cause livelocks under memory pressure.

So it seems that limiting fs_bio_set to only 2 entries was intended.
And the commit comment from the change that reduced this from 256 to 2
even said that this pool is only intended as a last resort to sustain
progress.

And I think in my testcase the important thing is not good
performance, but only to make sure the system does not hang.
Both of the situations that cause the hang for me where more cases of
"what not to do" then anything that should perform good. In the
original case the system started too many gcc's because I'm to lazy to
figure out a better way to organize the parallelization of independent
singlethreaded compiles and parallel makes. The Gentoo package manager
tries to use the load average to that point, but this is foiled if a
compile first has a singlethreaded part (like configure) and only
later switches to parallel compiles. So portage started 4 package
compilations, because during configure the load was low and then the
system had to deal with 20 gcc's (make -j5) eating all of its memory.
And even that seemed to only happen during one part of the compiles,
as in the not hanging cases, the swapping soon stopped. So it would
just have to survive the initial overallocation with the small mempool
and everything is find.
My reduced testcase is even more useless as an example for a real
load: I'm just using multiple dd's to fill a tmpfs as fast as I can to
see if raid1.c::make_request() breaks under memory pressure. And here
too the only goal should be that the kernel should survive this abuse.

Sorry, but increasing BIO_POOL_SIZE just looks like papering over the
real problem...


Torsten

> --- linux-next.orig/include/linux/bio.h 2010-10-20 16:55:57.000000000 +0800
> +++ linux-next/include/linux/bio.h      2010-10-20 16:56:54.000000000 +0800
> @@ -286,7 +286,7 @@ static inline void bio_set_completion_cp
>  * These memory pools in turn all allocate from the bio_slab
>  * and the bvec_slabs[].
>  */
> -#define BIO_POOL_SIZE 2
> +#define BIO_POOL_SIZE  64
>  #define BIOVEC_NR_POOLS 6
>  #define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20  9:27                         ` Wu Fengguang
@ 2010-10-20 13:03                           ` Jens Axboe
  -1 siblings, 0 replies; 116+ messages in thread
From: Jens Axboe @ 2010-10-20 13:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On 2010-10-20 11:27, Wu Fengguang wrote:
> On Wed, Oct 20, 2010 at 03:05:56PM +0800, KOSAKI Motohiro wrote:
>>> On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>>>> On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
>>>> <just.for.lkml@googlemail.com> wrote:
>>>>> On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
>>>>>> Yes, thanks for the report.
>>>>>> This is a real bug exactly as you describe.
>>>>>>
>>>>>> This is how I think I will fix it, though it needs a bit of review and
>>>>>> testing before I can be certain.
>>>>>> Also I need to check raid10 etc to see if they can suffer too.
>>>>>>
>>>>>> If you can test it I would really appreciate it.
>>>>>
>>>>> I did test it, but while it seemed to fix the deadlock, the system
>>>>> still got unusable.
>>>>> The still running "vmstat 1" showed that the swapout was still
>>>>> progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
>>>>>
>>>>> I also tried to additionally add Wu's patch:
>>>>> --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
>>>>> +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
>>>>> @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
>>>>>               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
>>>>>       }
>>>>>
>>>>> +       /*
>>>>> +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
>>>>> +        * they won't get blocked by normal ones and form circular deadlock.
>>>>> +        */
>>>>> +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
>>>>> +               inactive >>= 3;
>>>>> +
>>>>>       return isolated > inactive;
>>>>>
>>>>> Either it did help somewhat, or I was more lucky on my second try, but
>>>>> this time I needed ~5 tries instead of only 2 to get the system mostly
>>>>> stuck again. On the testrun with Wu's patch the writeout pattern was
>>>>> more stable, a burst of ~80kb each 20 seconds. But I would suspect
>>>>> that the size of the burst is rather random.
>>>>>
>>>>> I do have a complete SysRq+T dump from the first run, I can send that
>>>>> to anyone how wants it.
>>>>> (It's 190k so I don't want not spam it to the list)
>>>>
>>>> Is this call trace from the SysRq+T violation the rule to only
>>>> allocate one bio from bio_alloc() until its submitted?
>>>>
>>>> [  549.700038] Call Trace:
>>>> [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
>>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
>>>> [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
>>>> [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
>>>> [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
>>>> [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
>>>> [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
>>>> [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
>>>> [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
>>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
>>>> [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
>>>> [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
>>>> [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
>>>> [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
>>>> [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
>>>> [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
>>>> [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
>>>> [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
>>>> [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
>>>> [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
>>>> [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
>>>> [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
>>>> [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
>>>> [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
>>>> [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
>>>> [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
>>>> [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
>>>> [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
>>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
>>>> [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
>>>> [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
>>>> [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
>>>> [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
>>>> [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
>>>> [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
>>>> [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
>>>> ffffffff81073c59
>>>> [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
>>>> ffff88011e125fd8
>>>> [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
>>>> ffff88011e125fd8
>>>>
>>>> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>>>> bio. That bio is the submitted, but the submit path seems to get into
>>>> make_request from raid1.c and that allocates a second bio from
>>>> bio_alloc() via bio_clone().
>>>>
>>>> I am seeing this pattern (swap_writepage calling
>>>> md_make_request/make_request and then getting stuck in mempool_alloc)
>>>> more than 5 times in the SysRq+T output...
>>>
>>> I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
>>> inside mempool_alloc(), which can be fixed by this patch.
>>>
>>> Thanks,
>>> Fengguang
>>> ---
>>>
>>> concurrent direct page reclaim problem
>>>
>>>   __GFP_NORETRY page allocations may fail when there are many concurrent page
>>>   allocating tasks, but not necessary in real short of memory. The root cause
>>>   is, tasks will first run direct page reclaim to free some pages from the LRU
>>>   lists and put them to the per-cpu page lists and the buddy system, and then
>>>   try to get a free page from there.  However the free pages reclaimed by this
>>>   task may be consumed by other tasks when the direct reclaim task is able to
>>>   get the free page for itself.
>>>
>>>   Let's retry it a bit harder.
>>>
>>> --- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
>>> +++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
>>> @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
>>>  				unsigned long pages_reclaimed)
>>>  {
>>>  	/* Do not loop if specifically requested */
>>> -	if (gfp_mask & __GFP_NORETRY)
>>> +	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
>>>  		return 0;
>>>  
>>>  	/*
>>
>> SLUB usually try high order allocation with __GFP_NORETRY at first. In
>> other words, It strongly depend on __GFP_NORETRY don't any retry. I'm
>> worry this...
> 
> Right. I noticed that too. Hopefully the "limited" retry won't impact
> it too much. That said, we do need a better solution than such hacks.
> 
>> And, in this case, stucked tasks have PF_MEMALLOC. allocation with PF_MEMALLOC
>> failure mean this zone have zero memory purely. So, retrying don't solve anything.
> 
> The zone has no free (buddy) memory, but has plenty of reclaimable pages.
> The concurrent page reclaimers may steal pages reclaimed by this task
> from time to time, but not always. So retry reclaiming will help.
> 
>> And I think the root cause is in another.
>>
>> bio_clone() use fs_bio_set internally.
>>
>> 	struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
>> 	{
>> 	        struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
>> 	...
>>
>> and fs_bio_set is initialized very small pool size.
>>
>> 	#define BIO_POOL_SIZE 2
>> 	static int __init init_bio(void)
>> 	{
>> 		..
>> 	        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
> 
> Agreed. BIO_POOL_SIZE=2 is too small to be deadlock free.
> 
>> So, I think raid1.c need to use their own bioset instead fs_bio_set.
>> otherwise, bio pool exshost can happen very easily.
> 
> That would fix the deadlock, but not enough for good IO throughput
> when multiple CPUs are trying to submit IO. Increasing BIO_POOL_SIZE
> to a large value should help fix both the deadlock and IO throughput.
> 
>> But I'm not sure. I'm not IO expert.
> 
> [add CC to Jens]

We surely need 1 set aside for each level of that stack that will
potentially consume one. 1 should be enough for the generic pool, and
then clones will use a separate pool. So md and friends should really
have a pool per device, so that stacking will always work properly.

There should be no throughput concerns, it should purely be a safe guard
measure to prevent us deadlocking when doing IO for reclaim.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20 13:03                           ` Jens Axboe
  0 siblings, 0 replies; 116+ messages in thread
From: Jens Axboe @ 2010-10-20 13:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On 2010-10-20 11:27, Wu Fengguang wrote:
> On Wed, Oct 20, 2010 at 03:05:56PM +0800, KOSAKI Motohiro wrote:
>>> On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>>>> On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
>>>> <just.for.lkml@googlemail.com> wrote:
>>>>> On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
>>>>>> Yes, thanks for the report.
>>>>>> This is a real bug exactly as you describe.
>>>>>>
>>>>>> This is how I think I will fix it, though it needs a bit of review and
>>>>>> testing before I can be certain.
>>>>>> Also I need to check raid10 etc to see if they can suffer too.
>>>>>>
>>>>>> If you can test it I would really appreciate it.
>>>>>
>>>>> I did test it, but while it seemed to fix the deadlock, the system
>>>>> still got unusable.
>>>>> The still running "vmstat 1" showed that the swapout was still
>>>>> progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
>>>>>
>>>>> I also tried to additionally add Wu's patch:
>>>>> --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
>>>>> +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
>>>>> @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
>>>>>               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
>>>>>       }
>>>>>
>>>>> +       /*
>>>>> +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
>>>>> +        * they won't get blocked by normal ones and form circular deadlock.
>>>>> +        */
>>>>> +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
>>>>> +               inactive >>= 3;
>>>>> +
>>>>>       return isolated > inactive;
>>>>>
>>>>> Either it did help somewhat, or I was more lucky on my second try, but
>>>>> this time I needed ~5 tries instead of only 2 to get the system mostly
>>>>> stuck again. On the testrun with Wu's patch the writeout pattern was
>>>>> more stable, a burst of ~80kb each 20 seconds. But I would suspect
>>>>> that the size of the burst is rather random.
>>>>>
>>>>> I do have a complete SysRq+T dump from the first run, I can send that
>>>>> to anyone how wants it.
>>>>> (It's 190k so I don't want not spam it to the list)
>>>>
>>>> Is this call trace from the SysRq+T violation the rule to only
>>>> allocate one bio from bio_alloc() until its submitted?
>>>>
>>>> [  549.700038] Call Trace:
>>>> [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
>>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
>>>> [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
>>>> [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
>>>> [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
>>>> [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
>>>> [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
>>>> [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
>>>> [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
>>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
>>>> [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
>>>> [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
>>>> [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
>>>> [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
>>>> [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
>>>> [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
>>>> [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
>>>> [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
>>>> [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
>>>> [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
>>>> [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
>>>> [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
>>>> [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
>>>> [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
>>>> [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
>>>> [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
>>>> [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
>>>> [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
>>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
>>>> [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
>>>> [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
>>>> [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
>>>> [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
>>>> [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
>>>> [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
>>>> [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
>>>> ffffffff81073c59
>>>> [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
>>>> ffff88011e125fd8
>>>> [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
>>>> ffff88011e125fd8
>>>>
>>>> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>>>> bio. That bio is the submitted, but the submit path seems to get into
>>>> make_request from raid1.c and that allocates a second bio from
>>>> bio_alloc() via bio_clone().
>>>>
>>>> I am seeing this pattern (swap_writepage calling
>>>> md_make_request/make_request and then getting stuck in mempool_alloc)
>>>> more than 5 times in the SysRq+T output...
>>>
>>> I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
>>> inside mempool_alloc(), which can be fixed by this patch.
>>>
>>> Thanks,
>>> Fengguang
>>> ---
>>>
>>> concurrent direct page reclaim problem
>>>
>>>   __GFP_NORETRY page allocations may fail when there are many concurrent page
>>>   allocating tasks, but not necessary in real short of memory. The root cause
>>>   is, tasks will first run direct page reclaim to free some pages from the LRU
>>>   lists and put them to the per-cpu page lists and the buddy system, and then
>>>   try to get a free page from there.  However the free pages reclaimed by this
>>>   task may be consumed by other tasks when the direct reclaim task is able to
>>>   get the free page for itself.
>>>
>>>   Let's retry it a bit harder.
>>>
>>> --- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
>>> +++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
>>> @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
>>>  				unsigned long pages_reclaimed)
>>>  {
>>>  	/* Do not loop if specifically requested */
>>> -	if (gfp_mask & __GFP_NORETRY)
>>> +	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
>>>  		return 0;
>>>  
>>>  	/*
>>
>> SLUB usually try high order allocation with __GFP_NORETRY at first. In
>> other words, It strongly depend on __GFP_NORETRY don't any retry. I'm
>> worry this...
> 
> Right. I noticed that too. Hopefully the "limited" retry won't impact
> it too much. That said, we do need a better solution than such hacks.
> 
>> And, in this case, stucked tasks have PF_MEMALLOC. allocation with PF_MEMALLOC
>> failure mean this zone have zero memory purely. So, retrying don't solve anything.
> 
> The zone has no free (buddy) memory, but has plenty of reclaimable pages.
> The concurrent page reclaimers may steal pages reclaimed by this task
> from time to time, but not always. So retry reclaiming will help.
> 
>> And I think the root cause is in another.
>>
>> bio_clone() use fs_bio_set internally.
>>
>> 	struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
>> 	{
>> 	        struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
>> 	...
>>
>> and fs_bio_set is initialized very small pool size.
>>
>> 	#define BIO_POOL_SIZE 2
>> 	static int __init init_bio(void)
>> 	{
>> 		..
>> 	        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
> 
> Agreed. BIO_POOL_SIZE=2 is too small to be deadlock free.
> 
>> So, I think raid1.c need to use their own bioset instead fs_bio_set.
>> otherwise, bio pool exshost can happen very easily.
> 
> That would fix the deadlock, but not enough for good IO throughput
> when multiple CPUs are trying to submit IO. Increasing BIO_POOL_SIZE
> to a large value should help fix both the deadlock and IO throughput.
> 
>> But I'm not sure. I'm not IO expert.
> 
> [add CC to Jens]

We surely need 1 set aside for each level of that stack that will
potentially consume one. 1 should be enough for the generic pool, and
then clones will use a separate pool. So md and friends should really
have a pool per device, so that stacking will always work properly.

There should be no throughput concerns, it should purely be a safe guard
measure to prevent us deadlocking when doing IO for reclaim.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20  7:25                       ` Torsten Kaiser
@ 2010-10-20 14:23                         ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-20 14:23 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Wu Fengguang, Neil Brown, Rik van Riel, Andrew Morton,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

Hello

On Wed, Oct 20, 2010 at 09:25:49AM +0200, Torsten Kaiser wrote:
> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> >> bio. That bio is the submitted, but the submit path seems to get into
> >> make_request from raid1.c and that allocates a second bio from
> >> bio_alloc() via bio_clone().
> >>
> >> I am seeing this pattern (swap_writepage calling
> >> md_make_request/make_request and then getting stuck in mempool_alloc)
> >> more than 5 times in the SysRq+T output...
> >
> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> > inside mempool_alloc(), which can be fixed by this patch.
> 
> No. I tested the patch (ontop of Neils fix and your patch regarding
> too_many_isolated()), but the system got stuck the same way on the
> first try to fill the tmpfs.
> I think the basic problem is, that the mempool that should guarantee
> progress is exhausted because the raid1 device is stacked between the
> pageout code and the disks and so the "use only 1 bio"-rule gets
> violated.
> 
> > Thanks,
> > Fengguang
> > ---
> >
> > concurrent direct page reclaim problem
> >
> > ?__GFP_NORETRY page allocations may fail when there are many concurrent page
> > ?allocating tasks, but not necessary in real short of memory. The root cause
> > ?is, tasks will first run direct page reclaim to free some pages from the LRU
> > ?lists and put them to the per-cpu page lists and the buddy system, and then
> > ?try to get a free page from there. ?However the free pages reclaimed by this
> > ?task may be consumed by other tasks when the direct reclaim task is able to
> > ?get the free page for itself.
> 
> I believe the facts disagree with that assumtion. My bad for not
> posting this before, but I also used SysRq+M to see whats going on,
> but each time there still was some free memory.
> Here is the SysRq+M output from the run with only Neils patch applied,
> but on each other run the same ~14Mb stayed free


What is your problem?(Sorry if you explained it several time).
I read the thread. 
It seems Wu's patch solved deadlock problem by FS lock holding and too_many_isolated.
What is the problem remained in your case? unusable system by swapstorm?
If it is, I think it's expected behavior. Please see the below comment. 
(If I don't catch your point, Please explain your problem.)

> 
> [  437.481365] SysRq : Show Memory
> [  437.490003] Mem-Info:
> [  437.491357] Node 0 DMA per-cpu:
> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
> [  437.500032] Node 0 DMA32 per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
> [  437.500032] Node 1 DMA32 per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
> [  437.500032] Node 1 Normal per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
> active_anon:0kB inact
> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB i
> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
> writeback:404kB mapped:0kB shme
> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
> kernel_stack:0kB pagetables:0kB
>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
> all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004

Node 0 DMA : free 8008K but lowmem_reserve 8012K(2004 pages)
So page allocator can't allocate the page unless preferred zone is DMA

> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
> high:6052kB active_anon:2
> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 0 0

Node 0 DMA32 : free 2980K but min 4036K.
Few file LRU compare to anon LRU

Normally, it could fail to allocate the page. 
'Normal' means caller doesn't request alloc_pages with __GFP_HIGH or !__GFP_WAIT
Generally many call sites don't pass gfp_flag with __GFP_HIGH|!__GFP_WAIT.

> [  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
> inactive_file:28kB unevictable:0kB isolated(anon):768kB
> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 505 505

Node 1 DMA32 free : 2188K min 3036K 
It's a same situation with Node 0 DMA32. 
Normally, it could fail to allocate the page. 
Few file LRU compare to anon LRU


> [  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
> high:1524kB active_anon:5312kB inactive_anon:459544kB
> active_file:3228kB inactive_file:3084kB unevictable:0kB
> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:9678 all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 0 0

Node 1 Normal : free 708K min 1016K 
Normally, it could fail to allocate the page. 
Few file LRU compare to anon LRU

> [  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
> [  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
> [  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
> [  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
> [  437.500032] 989289 total pagecache pages
> [  437.500032] 25398 pages in swap cache
> [  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
> [  437.500032] Free swap  = 9865628kB
> [  437.500032] Total swap = 10000316kB
> [  437.500032] 1048575 pages RAM
> [  437.500032] 33809 pages reserved
> [  437.500032] 7996 pages shared
> [  437.500032] 1008521 pages non-shared
> 
All zones don't have enough pages and don't have enough file lru pages.
So swapout is expected behavior, I think.
It means your workload exceeds your system available DRAM size.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20 14:23                         ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-20 14:23 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Wu Fengguang, Neil Brown, Rik van Riel, Andrew Morton,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

Hello

On Wed, Oct 20, 2010 at 09:25:49AM +0200, Torsten Kaiser wrote:
> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> >> bio. That bio is the submitted, but the submit path seems to get into
> >> make_request from raid1.c and that allocates a second bio from
> >> bio_alloc() via bio_clone().
> >>
> >> I am seeing this pattern (swap_writepage calling
> >> md_make_request/make_request and then getting stuck in mempool_alloc)
> >> more than 5 times in the SysRq+T output...
> >
> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> > inside mempool_alloc(), which can be fixed by this patch.
> 
> No. I tested the patch (ontop of Neils fix and your patch regarding
> too_many_isolated()), but the system got stuck the same way on the
> first try to fill the tmpfs.
> I think the basic problem is, that the mempool that should guarantee
> progress is exhausted because the raid1 device is stacked between the
> pageout code and the disks and so the "use only 1 bio"-rule gets
> violated.
> 
> > Thanks,
> > Fengguang
> > ---
> >
> > concurrent direct page reclaim problem
> >
> > ?__GFP_NORETRY page allocations may fail when there are many concurrent page
> > ?allocating tasks, but not necessary in real short of memory. The root cause
> > ?is, tasks will first run direct page reclaim to free some pages from the LRU
> > ?lists and put them to the per-cpu page lists and the buddy system, and then
> > ?try to get a free page from there. ?However the free pages reclaimed by this
> > ?task may be consumed by other tasks when the direct reclaim task is able to
> > ?get the free page for itself.
> 
> I believe the facts disagree with that assumtion. My bad for not
> posting this before, but I also used SysRq+M to see whats going on,
> but each time there still was some free memory.
> Here is the SysRq+M output from the run with only Neils patch applied,
> but on each other run the same ~14Mb stayed free


What is your problem?(Sorry if you explained it several time).
I read the thread. 
It seems Wu's patch solved deadlock problem by FS lock holding and too_many_isolated.
What is the problem remained in your case? unusable system by swapstorm?
If it is, I think it's expected behavior. Please see the below comment. 
(If I don't catch your point, Please explain your problem.)

> 
> [  437.481365] SysRq : Show Memory
> [  437.490003] Mem-Info:
> [  437.491357] Node 0 DMA per-cpu:
> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
> [  437.500032] Node 0 DMA32 per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
> [  437.500032] Node 1 DMA32 per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
> [  437.500032] Node 1 Normal per-cpu:
> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
> active_anon:0kB inact
> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB i
> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
> writeback:404kB mapped:0kB shme
> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
> kernel_stack:0kB pagetables:0kB
>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
> all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004

Node 0 DMA : free 8008K but lowmem_reserve 8012K(2004 pages)
So page allocator can't allocate the page unless preferred zone is DMA

> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
> high:6052kB active_anon:2
> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 0 0

Node 0 DMA32 : free 2980K but min 4036K.
Few file LRU compare to anon LRU

Normally, it could fail to allocate the page. 
'Normal' means caller doesn't request alloc_pages with __GFP_HIGH or !__GFP_WAIT
Generally many call sites don't pass gfp_flag with __GFP_HIGH|!__GFP_WAIT.

> [  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
> inactive_file:28kB unevictable:0kB isolated(anon):768kB
> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 505 505

Node 1 DMA32 free : 2188K min 3036K 
It's a same situation with Node 0 DMA32. 
Normally, it could fail to allocate the page. 
Few file LRU compare to anon LRU


> [  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
> high:1524kB active_anon:5312kB inactive_anon:459544kB
> active_file:3228kB inactive_file:3084kB unevictable:0kB
> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:9678 all_unreclaimable? no
> [  437.500032] lowmem_reserve[]: 0 0 0 0

Node 1 Normal : free 708K min 1016K 
Normally, it could fail to allocate the page. 
Few file LRU compare to anon LRU

> [  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
> [  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
> [  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
> [  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
> [  437.500032] 989289 total pagecache pages
> [  437.500032] 25398 pages in swap cache
> [  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
> [  437.500032] Free swap  = 9865628kB
> [  437.500032] Total swap = 10000316kB
> [  437.500032] 1048575 pages RAM
> [  437.500032] 33809 pages reserved
> [  437.500032] 7996 pages shared
> [  437.500032] 1008521 pages non-shared
> 
All zones don't have enough pages and don't have enough file lru pages.
So swapout is expected behavior, I think.
It means your workload exceeds your system available DRAM size.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20 14:23                         ` Minchan Kim
@ 2010-10-20 15:35                           ` Torsten Kaiser
  -1 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-20 15:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu Fengguang, Neil Brown, Rik van Riel, Andrew Morton,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua, Jens Axboe

On Wed, Oct 20, 2010 at 4:23 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> Hello
>
> On Wed, Oct 20, 2010 at 09:25:49AM +0200, Torsten Kaiser wrote:
>> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>> >> bio. That bio is the submitted, but the submit path seems to get into
>> >> make_request from raid1.c and that allocates a second bio from
>> >> bio_alloc() via bio_clone().
>> >>
>> >> I am seeing this pattern (swap_writepage calling
>> >> md_make_request/make_request and then getting stuck in mempool_alloc)
>> >> more than 5 times in the SysRq+T output...
>> >
>> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
>> > inside mempool_alloc(), which can be fixed by this patch.
>>
>> No. I tested the patch (ontop of Neils fix and your patch regarding
>> too_many_isolated()), but the system got stuck the same way on the
>> first try to fill the tmpfs.
>> I think the basic problem is, that the mempool that should guarantee
>> progress is exhausted because the raid1 device is stacked between the
>> pageout code and the disks and so the "use only 1 bio"-rule gets
>> violated.
>>
>> > Thanks,
>> > Fengguang
>> > ---
>> >
>> > concurrent direct page reclaim problem
>> >
>> > ?__GFP_NORETRY page allocations may fail when there are many concurrent page
>> > ?allocating tasks, but not necessary in real short of memory. The root cause
>> > ?is, tasks will first run direct page reclaim to free some pages from the LRU
>> > ?lists and put them to the per-cpu page lists and the buddy system, and then
>> > ?try to get a free page from there. ?However the free pages reclaimed by this
>> > ?task may be consumed by other tasks when the direct reclaim task is able to
>> > ?get the free page for itself.
>>
>> I believe the facts disagree with that assumtion. My bad for not
>> posting this before, but I also used SysRq+M to see whats going on,
>> but each time there still was some free memory.
>> Here is the SysRq+M output from the run with only Neils patch applied,
>> but on each other run the same ~14Mb stayed free
>
>
> What is your problem?(Sorry if you explained it several time).

The original problem was that using too many gcc's caused a swapstorm
that completely hung my system.
I first blame it one the workqueue changes in 2.6.36-rc1 and/or its
interaction with XFS (because in -rc5 a workqueue related problem in
XFS got fixed), but Tejun Heo found out that a) it were exhausted
mempools and not the workqueues and that the problems itself existed
at least in 2.6.35 already. In
http://marc.info/?l=linux-raid&m=128699402805191&w=2 I have describe a
simpler testcase that I found after looking more closely into the
mempools.

Short story: swaping over RAID1 (drivers/md/raid1.c) can cause a
system hang, because it is using too much of the fs_bio_set mempool
from fs/bio.c.

> I read the thread.
> It seems Wu's patch solved deadlock problem by FS lock holding and too_many_isolated.
> What is the problem remained in your case? unusable system by swapstorm?
> If it is, I think it's expected behavior. Please see the below comment.
> (If I don't catch your point, Please explain your problem.)

I do not have a problem, if the system becomes unusable *during* a
swapstorm, but it should recover. That is not the case in my system.
With Wu's too_many_isolated-patch and Neil's patch agains raid1.c the
system does no longer seem to be completely stuck (a swapoutrate of
~80kb every 20 seconds still happens), but I would still expect a
better recovery time. (At that rate the recovery would probably take a
few days...)

>> [  437.481365] SysRq : Show Memory
>> [  437.490003] Mem-Info:
>> [  437.491357] Node 0 DMA per-cpu:
>> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
>> [  437.500032] Node 0 DMA32 per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>> [  437.500032] Node 1 DMA32 per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>> [  437.500032] Node 1 Normal per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
>> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
>> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
>> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
>> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
>> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
>> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
>> active_anon:0kB inact
>> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
>> isolated(anon):0kB i
>> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
>> writeback:404kB mapped:0kB shme
>> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
>> kernel_stack:0kB pagetables:0kB
>>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
>> all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004
>
> Node 0 DMA : free 8008K but lowmem_reserve 8012K(2004 pages)
> So page allocator can't allocate the page unless preferred zone is DMA
>
>> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
>> high:6052kB active_anon:2
>> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
>> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
>> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
>> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
>> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
>> all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>
> Node 0 DMA32 : free 2980K but min 4036K.
> Few file LRU compare to anon LRU

In the testcase I fill a tmpfs as fast as I can with data from
/dev/zero. So nearly everything gets swapped out and only the last
written data from the tmpfs fills all RAM. (I have 4GB RAM, the tmpfs
is limited to 6GB, 16 dd's are writing into it)

> Normally, it could fail to allocate the page.
> 'Normal' means caller doesn't request alloc_pages with __GFP_HIGH or !__GFP_WAIT
> Generally many call sites don't pass gfp_flag with __GFP_HIGH|!__GFP_WAIT.
>
>> [  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
>> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
>> inactive_file:28kB unevictable:0kB isolated(anon):768kB
>> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
>> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
>> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
>> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 0 505 505
>
> Node 1 DMA32 free : 2188K min 3036K
> It's a same situation with Node 0 DMA32.
> Normally, it could fail to allocate the page.
> Few file LRU compare to anon LRU
>
>
>> [  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
>> high:1524kB active_anon:5312kB inactive_anon:459544kB
>> active_file:3228kB inactive_file:3084kB unevictable:0kB
>> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
>> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
>> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
>> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
>> pages_scanned:9678 all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>
> Node 1 Normal : free 708K min 1016K
> Normally, it could fail to allocate the page.
> Few file LRU compare to anon LRU
>
>> [  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
>> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
>> [  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
>> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
>> [  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
>> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
>> [  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
>> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
>> [  437.500032] 989289 total pagecache pages
>> [  437.500032] 25398 pages in swap cache
>> [  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
>> [  437.500032] Free swap  = 9865628kB
>> [  437.500032] Total swap = 10000316kB
>> [  437.500032] 1048575 pages RAM
>> [  437.500032] 33809 pages reserved
>> [  437.500032] 7996 pages shared
>> [  437.500032] 1008521 pages non-shared
>>
> All zones don't have enough pages and don't have enough file lru pages.
> So swapout is expected behavior, I think.
> It means your workload exceeds your system available DRAM size.

Yes, as intended. I wanted to create many writes to a RAID1 device
under memory pressure to show/verify that the current use of mempools
in raid1.c is buggered.

That is not really any sane workload, that literally is just there to
create a swapstorm and then see if the system survives it.

The problem is, that the system is not surviving it: bio allocations
fail in raid1.c and it falls back to the fs_bio_set mempool. But that
mempool is only 2 entries big, because you should ever only use one of
its entries at a time. But the current mainline code from raid1.c
allocates one bio per drive before submitting it -> That bug is fixed
my Neil's patch and I would have expected that to fix my hang. But it
seems that there is an additional problem so that mempool still get
emptied. And that means that no writeback happens any longer and so
the kernel can't swapout and gets stuck.

I think the last mail from Jens Axboe is the correct answer, not
increasing the fs_bio_set mempool size via BIO_POOL_SIZE.

But should that go even further: Forbid any use of bio_alloc() and
bio_clone() in any device drivers? Or at the very least in all device
drivers that could be used for swapspace?

Torsten

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20 15:35                           ` Torsten Kaiser
  0 siblings, 0 replies; 116+ messages in thread
From: Torsten Kaiser @ 2010-10-20 15:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wu Fengguang, Neil Brown, Rik van Riel, Andrew Morton,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua, Jens Axboe

On Wed, Oct 20, 2010 at 4:23 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> Hello
>
> On Wed, Oct 20, 2010 at 09:25:49AM +0200, Torsten Kaiser wrote:
>> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
>> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
>> >> bio. That bio is the submitted, but the submit path seems to get into
>> >> make_request from raid1.c and that allocates a second bio from
>> >> bio_alloc() via bio_clone().
>> >>
>> >> I am seeing this pattern (swap_writepage calling
>> >> md_make_request/make_request and then getting stuck in mempool_alloc)
>> >> more than 5 times in the SysRq+T output...
>> >
>> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
>> > inside mempool_alloc(), which can be fixed by this patch.
>>
>> No. I tested the patch (ontop of Neils fix and your patch regarding
>> too_many_isolated()), but the system got stuck the same way on the
>> first try to fill the tmpfs.
>> I think the basic problem is, that the mempool that should guarantee
>> progress is exhausted because the raid1 device is stacked between the
>> pageout code and the disks and so the "use only 1 bio"-rule gets
>> violated.
>>
>> > Thanks,
>> > Fengguang
>> > ---
>> >
>> > concurrent direct page reclaim problem
>> >
>> > ?__GFP_NORETRY page allocations may fail when there are many concurrent page
>> > ?allocating tasks, but not necessary in real short of memory. The root cause
>> > ?is, tasks will first run direct page reclaim to free some pages from the LRU
>> > ?lists and put them to the per-cpu page lists and the buddy system, and then
>> > ?try to get a free page from there. ?However the free pages reclaimed by this
>> > ?task may be consumed by other tasks when the direct reclaim task is able to
>> > ?get the free page for itself.
>>
>> I believe the facts disagree with that assumtion. My bad for not
>> posting this before, but I also used SysRq+M to see whats going on,
>> but each time there still was some free memory.
>> Here is the SysRq+M output from the run with only Neils patch applied,
>> but on each other run the same ~14Mb stayed free
>
>
> What is your problem?(Sorry if you explained it several time).

The original problem was that using too many gcc's caused a swapstorm
that completely hung my system.
I first blame it one the workqueue changes in 2.6.36-rc1 and/or its
interaction with XFS (because in -rc5 a workqueue related problem in
XFS got fixed), but Tejun Heo found out that a) it were exhausted
mempools and not the workqueues and that the problems itself existed
at least in 2.6.35 already. In
http://marc.info/?l=linux-raid&m=128699402805191&w=2 I have describe a
simpler testcase that I found after looking more closely into the
mempools.

Short story: swaping over RAID1 (drivers/md/raid1.c) can cause a
system hang, because it is using too much of the fs_bio_set mempool
from fs/bio.c.

> I read the thread.
> It seems Wu's patch solved deadlock problem by FS lock holding and too_many_isolated.
> What is the problem remained in your case? unusable system by swapstorm?
> If it is, I think it's expected behavior. Please see the below comment.
> (If I don't catch your point, Please explain your problem.)

I do not have a problem, if the system becomes unusable *during* a
swapstorm, but it should recover. That is not the case in my system.
With Wu's too_many_isolated-patch and Neil's patch agains raid1.c the
system does no longer seem to be completely stuck (a swapoutrate of
~80kb every 20 seconds still happens), but I would still expect a
better recovery time. (At that rate the recovery would probably take a
few days...)

>> [  437.481365] SysRq : Show Memory
>> [  437.490003] Mem-Info:
>> [  437.491357] Node 0 DMA per-cpu:
>> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
>> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
>> [  437.500032] Node 0 DMA32 per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>> [  437.500032] Node 1 DMA32 per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>> [  437.500032] Node 1 Normal per-cpu:
>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
>> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
>> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
>> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
>> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
>> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
>> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
>> active_anon:0kB inact
>> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
>> isolated(anon):0kB i
>> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
>> writeback:404kB mapped:0kB shme
>> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
>> kernel_stack:0kB pagetables:0kB
>>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
>> all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004
>
> Node 0 DMA : free 8008K but lowmem_reserve 8012K(2004 pages)
> So page allocator can't allocate the page unless preferred zone is DMA
>
>> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
>> high:6052kB active_anon:2
>> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
>> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
>> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
>> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
>> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
>> all_unreclaimable? no
>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>
> Node 0 DMA32 : free 2980K but min 4036K.
> Few file LRU compare to anon LRU

In the testcase I fill a tmpfs as fast as I can with data from
/dev/zero. So nearly everything gets swapped out and only the last
written data from the tmpfs fills all RAM. (I have 4GB RAM, the tmpfs
is limited to 6GB, 16 dd's are writing into it)

> Normally, it could fail to allocate the page.
> 'Normal' means caller doesn't request alloc_pages with __GFP_HIGH or !__GFP_WAIT
> Generally many call sites don't pass gfp_flag with __GFP_HIGH|!__GFP_WAIT

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20 15:35                           ` Torsten Kaiser
@ 2010-10-20 23:31                             ` Minchan Kim
  -1 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-20 23:31 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Wu Fengguang, Neil Brown, Rik van Riel, Andrew Morton,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua, Jens Axboe

On Thu, Oct 21, 2010 at 12:35 AM, Torsten Kaiser
<just.for.lkml@googlemail.com> wrote:
> On Wed, Oct 20, 2010 at 4:23 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

< SNIP>

>> What is your problem?(Sorry if you explained it several time).
>
> The original problem was that using too many gcc's caused a swapstorm
> that completely hung my system.
> I first blame it one the workqueue changes in 2.6.36-rc1 and/or its
> interaction with XFS (because in -rc5 a workqueue related problem in
> XFS got fixed), but Tejun Heo found out that a) it were exhausted
> mempools and not the workqueues and that the problems itself existed
> at least in 2.6.35 already. In
> http://marc.info/?l=linux-raid&m=128699402805191&w=2 I have describe a
> simpler testcase that I found after looking more closely into the
> mempools.
>
> Short story: swaping over RAID1 (drivers/md/raid1.c) can cause a
> system hang, because it is using too much of the fs_bio_set mempool
> from fs/bio.c.
>
>> I read the thread.
>> It seems Wu's patch solved deadlock problem by FS lock holding and too_many_isolated.
>> What is the problem remained in your case? unusable system by swapstorm?
>> If it is, I think it's expected behavior. Please see the below comment.
>> (If I don't catch your point, Please explain your problem.)
>
> I do not have a problem, if the system becomes unusable *during* a
> swapstorm, but it should recover. That is not the case in my system.
> With Wu's too_many_isolated-patch and Neil's patch agains raid1.c the
> system does no longer seem to be completely stuck (a swapoutrate of
> ~80kb every 20 seconds still happens), but I would still expect a
> better recovery time. (At that rate the recovery would probably take a
> few days...)


I got understand your problem.
BTW, Wu's too_many_isolated patch should merge regardless of this problem.
It's another story.

>
>>> [  437.481365] SysRq : Show Memory
>>> [  437.490003] Mem-Info:
>>> [  437.491357] Node 0 DMA per-cpu:
>>> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
>>> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
>>> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
>>> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
>>> [  437.500032] Node 0 DMA32 per-cpu:
>>> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
>>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
>>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>>> [  437.500032] Node 1 DMA32 per-cpu:
>>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>>> [  437.500032] Node 1 Normal per-cpu:
>>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
>>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
>>> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
>>> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
>>> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
>>> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
>>> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
>>> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
>>> active_anon:0kB inact
>>> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
>>> isolated(anon):0kB i
>>> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
>>> writeback:404kB mapped:0kB shme
>>> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
>>> kernel_stack:0kB pagetables:0kB
>>>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
>>> all_unreclaimable? no
>>> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004
>>
>> Node 0 DMA : free 8008K but lowmem_reserve 8012K(2004 pages)
>> So page allocator can't allocate the page unless preferred zone is DMA
>>
>>> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
>>> high:6052kB active_anon:2
>>> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
>>> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
>>> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
>>> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
>>> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
>>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
>>> all_unreclaimable? no
>>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>>
>> Node 0 DMA32 : free 2980K but min 4036K.
>> Few file LRU compare to anon LRU
>
> In the testcase I fill a tmpfs as fast as I can with data from
> /dev/zero. So nearly everything gets swapped out and only the last
> written data from the tmpfs fills all RAM. (I have 4GB RAM, the tmpfs
> is limited to 6GB, 16 dd's are writing into it)
>
>> Normally, it could fail to allocate the page.
>> 'Normal' means caller doesn't request alloc_pages with __GFP_HIGH or !__GFP_WAIT
>> Generally many call sites don't pass gfp_flag with __GFP_HIGH|!__GFP_WAIT.
>>
>>> [  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
>>> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
>>> inactive_file:28kB unevictable:0kB isolated(anon):768kB
>>> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
>>> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
>>> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
>>> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>>> [  437.500032] lowmem_reserve[]: 0 0 505 505
>>
>> Node 1 DMA32 free : 2188K min 3036K
>> It's a same situation with Node 0 DMA32.
>> Normally, it could fail to allocate the page.
>> Few file LRU compare to anon LRU
>>
>>
>>> [  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
>>> high:1524kB active_anon:5312kB inactive_anon:459544kB
>>> active_file:3228kB inactive_file:3084kB unevictable:0kB
>>> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
>>> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
>>> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
>>> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
>>> pages_scanned:9678 all_unreclaimable? no
>>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>>
>> Node 1 Normal : free 708K min 1016K
>> Normally, it could fail to allocate the page.
>> Few file LRU compare to anon LRU
>>
>>> [  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
>>> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
>>> [  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
>>> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
>>> [  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
>>> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
>>> [  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
>>> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
>>> [  437.500032] 989289 total pagecache pages
>>> [  437.500032] 25398 pages in swap cache
>>> [  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
>>> [  437.500032] Free swap  = 9865628kB
>>> [  437.500032] Total swap = 10000316kB
>>> [  437.500032] 1048575 pages RAM
>>> [  437.500032] 33809 pages reserved
>>> [  437.500032] 7996 pages shared
>>> [  437.500032] 1008521 pages non-shared
>>>
>> All zones don't have enough pages and don't have enough file lru pages.
>> So swapout is expected behavior, I think.
>> It means your workload exceeds your system available DRAM size.
>
> Yes, as intended. I wanted to create many writes to a RAID1 device
> under memory pressure to show/verify that the current use of mempools
> in raid1.c is buggered.
>
> That is not really any sane workload, that literally is just there to
> create a swapstorm and then see if the system survives it.
>
> The problem is, that the system is not surviving it: bio allocations
> fail in raid1.c and it falls back to the fs_bio_set mempool. But that
> mempool is only 2 entries big, because you should ever only use one of
> its entries at a time. But the current mainline code from raid1.c
> allocates one bio per drive before submitting it -> That bug is fixed
> my Neil's patch and I would have expected that to fix my hang. But it
> seems that there is an additional problem so that mempool still get
> emptied. And that means that no writeback happens any longer and so
> the kernel can't swapout and gets stuck.

That's what I missed that why there are lots of writeback pages in log.
Thanks for kind explanation.

>
> I think the last mail from Jens Axboe is the correct answer, not
> increasing the fs_bio_set mempool size via BIO_POOL_SIZE.
>
> But should that go even further: Forbid any use of bio_alloc() and
> bio_clone() in any device drivers? Or at the very least in all device
> drivers that could be used for swapspace?

But like Jens pointed out, "So md and friends should really have a
pool per device, so that stacking will always work properly."
Shouldn't raid1.c have a pool for bio_set and use own bio_set like
setup_clone in dm.c?
Maybe I am wrong since I don't have a knowledge about RAID.

>
> Torsten
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-20 23:31                             ` Minchan Kim
  0 siblings, 0 replies; 116+ messages in thread
From: Minchan Kim @ 2010-10-20 23:31 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Wu Fengguang, Neil Brown, Rik van Riel, Andrew Morton,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua, Jens Axboe

On Thu, Oct 21, 2010 at 12:35 AM, Torsten Kaiser
<just.for.lkml@googlemail.com> wrote:
> On Wed, Oct 20, 2010 at 4:23 PM, Minchan Kim <minchan.kim@gmail.com> wrote:

< SNIP>

>> What is your problem?(Sorry if you explained it several time).
>
> The original problem was that using too many gcc's caused a swapstorm
> that completely hung my system.
> I first blame it one the workqueue changes in 2.6.36-rc1 and/or its
> interaction with XFS (because in -rc5 a workqueue related problem in
> XFS got fixed), but Tejun Heo found out that a) it were exhausted
> mempools and not the workqueues and that the problems itself existed
> at least in 2.6.35 already. In
> http://marc.info/?l=linux-raid&m=128699402805191&w=2 I have describe a
> simpler testcase that I found after looking more closely into the
> mempools.
>
> Short story: swaping over RAID1 (drivers/md/raid1.c) can cause a
> system hang, because it is using too much of the fs_bio_set mempool
> from fs/bio.c.
>
>> I read the thread.
>> It seems Wu's patch solved deadlock problem by FS lock holding and too_many_isolated.
>> What is the problem remained in your case? unusable system by swapstorm?
>> If it is, I think it's expected behavior. Please see the below comment.
>> (If I don't catch your point, Please explain your problem.)
>
> I do not have a problem, if the system becomes unusable *during* a
> swapstorm, but it should recover. That is not the case in my system.
> With Wu's too_many_isolated-patch and Neil's patch agains raid1.c the
> system does no longer seem to be completely stuck (a swapoutrate of
> ~80kb every 20 seconds still happens), but I would still expect a
> better recovery time. (At that rate the recovery would probably take a
> few days...)


I got understand your problem.
BTW, Wu's too_many_isolated patch should merge regardless of this problem.
It's another story.

>
>>> [  437.481365] SysRq : Show Memory
>>> [  437.490003] Mem-Info:
>>> [  437.491357] Node 0 DMA per-cpu:
>>> [  437.500032] CPU    0: hi:    0, btch:   1 usd:   0
>>> [  437.500032] CPU    1: hi:    0, btch:   1 usd:   0
>>> [  437.500032] CPU    2: hi:    0, btch:   1 usd:   0
>>> [  437.500032] CPU    3: hi:    0, btch:   1 usd:   0
>>> [  437.500032] Node 0 DMA32 per-cpu:
>>> [  437.500032] CPU    0: hi:  186, btch:  31 usd: 138
>>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:  30
>>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>>> [  437.500032] Node 1 DMA32 per-cpu:
>>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:   0
>>> [  437.500032] Node 1 Normal per-cpu:
>>> [  437.500032] CPU    0: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    1: hi:  186, btch:  31 usd:   0
>>> [  437.500032] CPU    2: hi:  186, btch:  31 usd:  25
>>> [  437.500032] CPU    3: hi:  186, btch:  31 usd:  30
>>> [  437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:682
>>> [  437.500032]  active_file:1667 inactive_file:1723 isolated_file:0
>>> [  437.500032]  unevictable:0 dirty:0 writeback:25387 unstable:0
>>> [  437.500032]  free:3471 slab_reclaimable:2840 slab_unreclaimable:6337
>>> [  437.500032]  mapped:1284 shmem:960501 pagetables:523 bounce:0
>>> [  437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB
>>> active_anon:0kB inact
>>> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB
>>> isolated(anon):0kB i
>>> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB
>>> writeback:404kB mapped:0kB shme
>>> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB
>>> kernel_stack:0kB pagetables:0kB
>>>  unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118
>>> all_unreclaimable? no
>>> [  437.500032] lowmem_reserve[]: 0 2004 2004 2004
>>
>> Node 0 DMA : free 8008K but lowmem_reserve 8012K(2004 pages)
>> So page allocator can't allocate the page unless preferred zone is DMA
>>
>>> [  437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB
>>> high:6052kB active_anon:2
>>> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB
>>> unevictable:0kB isolated(anon):1232kB isolated(file):0kB
>>> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB
>>> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB
>>> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB
>>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
>>> all_unreclaimable? no
>>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>>
>> Node 0 DMA32 : free 2980K but min 4036K.
>> Few file LRU compare to anon LRU
>
> In the testcase I fill a tmpfs as fast as I can with data from
> /dev/zero. So nearly everything gets swapped out and only the last
> written data from the tmpfs fills all RAM. (I have 4GB RAM, the tmpfs
> is limited to 6GB, 16 dd's are writing into it)
>
>> Normally, it could fail to allocate the page.
>> 'Normal' means caller doesn't request alloc_pages with __GFP_HIGH or !__GFP_WAIT
>> Generally many call sites don't pass gfp_flag with __GFP_HIGH|!__GFP_WAIT.
>>
>>> [  437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB
>>> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB
>>> inactive_file:28kB unevictable:0kB isolated(anon):768kB
>>> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB
>>> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB
>>> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB
>>> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>>> [  437.500032] lowmem_reserve[]: 0 0 505 505
>>
>> Node 1 DMA32 free : 2188K min 3036K
>> It's a same situation with Node 0 DMA32.
>> Normally, it could fail to allocate the page.
>> Few file LRU compare to anon LRU
>>
>>
>>> [  437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB
>>> high:1524kB active_anon:5312kB inactive_anon:459544kB
>>> active_file:3228kB inactive_file:3084kB unevictable:0kB
>>> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB
>>> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB
>>> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB
>>> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB
>>> pages_scanned:9678 all_unreclaimable? no
>>> [  437.500032] lowmem_reserve[]: 0 0 0 0
>>
>> Node 1 Normal : free 708K min 1016K
>> Normally, it could fail to allocate the page.
>> Few file LRU compare to anon LRU
>>
>>> [  437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB
>>> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8008kB
>>> [  437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB
>>> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2980kB
>>> [  437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB
>>> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2308kB
>>> [  437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB
>>> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 708kB
>>> [  437.500032] 989289 total pagecache pages
>>> [  437.500032] 25398 pages in swap cache
>>> [  437.500032] Swap cache stats: add 859204, delete 833806, find 28/39
>>> [  437.500032] Free swap  = 9865628kB
>>> [  437.500032] Total swap = 10000316kB
>>> [  437.500032] 1048575 pages RAM
>>> [  437.500032] 33809 pages reserved
>>> [  437.500032] 7996 pages shared
>>> [  437.500032] 1008521 pages non-shared
>>>
>> All zones don't have enough pages and don't have enough file lru pages.
>> So swapout is expected behavior, I think.
>> It means your workload exceeds your system available DRAM size.
>
> Yes, as intended. I wanted to create many writes to a RAID1 device
> under memory pressure to show/verify that the current use of mempools
> in raid1.c is buggered.
>
> That is not really any sane workload, that literally is just there to
> create a swapstorm and then see if the system survives it.
>
> The problem is, that the system is not surviving it: bio allocations
> fail in raid1.c and it falls back to the fs_bio_set mempool. But that
> mempool is only 2 entries big, because you should ever only use one of
> its entries at a time. But the current mainline code from raid1.c
> allocates one bio per drive before submitting it -> That bug is fixed
> my Neil's patch and I would have expected that to fix my hang. But it
> seems that there is an additional problem so that mempool still get
> emptied. And that means that no writeback happens any longer and so
> the kernel can't swapout and gets stuck.

That's what I missed that why there are lots of writeback pages in log.
Thanks for kind explanation.

>
> I think the last mail from Jens Axboe is the correct answer, not
> increasing the fs_bio_set mempool size via BIO_POOL_SIZE.
>
> But should that go even further: Forbid any use of bio_alloc() and
> bio_clone() in any device drivers? Or at the very least in all device
> drivers that could be used for swapspace?

But like Jens pointed out, "So md and friends should really have a
pool per device, so that stacking will always work properly."
Shouldn't raid1.c have a pool for bio_set and use own bio_set like
setup_clone in dm.c?
Maybe I am wrong since I don't have a knowledge about RAID.

>
> Torsten
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-20 13:03                           ` Jens Axboe
@ 2010-10-22  5:37                             ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-22  5:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On Wed, Oct 20, 2010 at 09:03:04PM +0800, Jens Axboe wrote:
> On 2010-10-20 11:27, Wu Fengguang wrote:
> > On Wed, Oct 20, 2010 at 03:05:56PM +0800, KOSAKI Motohiro wrote:
> >>> On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> >>>> On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> >>>> <just.for.lkml@googlemail.com> wrote:
> >>>>> On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> >>>>>> Yes, thanks for the report.
> >>>>>> This is a real bug exactly as you describe.
> >>>>>>
> >>>>>> This is how I think I will fix it, though it needs a bit of review and
> >>>>>> testing before I can be certain.
> >>>>>> Also I need to check raid10 etc to see if they can suffer too.
> >>>>>>
> >>>>>> If you can test it I would really appreciate it.
> >>>>>
> >>>>> I did test it, but while it seemed to fix the deadlock, the system
> >>>>> still got unusable.
> >>>>> The still running "vmstat 1" showed that the swapout was still
> >>>>> progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> >>>>>
> >>>>> I also tried to additionally add Wu's patch:
> >>>>> --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> >>>>> +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
> >>>>> @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> >>>>>               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> >>>>>       }
> >>>>>
> >>>>> +       /*
> >>>>> +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> >>>>> +        * they won't get blocked by normal ones and form circular deadlock.
> >>>>> +        */
> >>>>> +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> >>>>> +               inactive >>= 3;
> >>>>> +
> >>>>>       return isolated > inactive;
> >>>>>
> >>>>> Either it did help somewhat, or I was more lucky on my second try, but
> >>>>> this time I needed ~5 tries instead of only 2 to get the system mostly
> >>>>> stuck again. On the testrun with Wu's patch the writeout pattern was
> >>>>> more stable, a burst of ~80kb each 20 seconds. But I would suspect
> >>>>> that the size of the burst is rather random.
> >>>>>
> >>>>> I do have a complete SysRq+T dump from the first run, I can send that
> >>>>> to anyone how wants it.
> >>>>> (It's 190k so I don't want not spam it to the list)
> >>>>
> >>>> Is this call trace from the SysRq+T violation the rule to only
> >>>> allocate one bio from bio_alloc() until its submitted?
> >>>>
> >>>> [  549.700038] Call Trace:
> >>>> [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> >>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> >>>> [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> >>>> [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> >>>> [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> >>>> [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> >>>> [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> >>>> [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> >>>> [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> >>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> >>>> [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> >>>> [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> >>>> [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> >>>> [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> >>>> [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> >>>> [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> >>>> [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> >>>> [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> >>>> [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> >>>> [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> >>>> [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> >>>> [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> >>>> [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> >>>> [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> >>>> [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> >>>> [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> >>>> [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> >>>> [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> >>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> >>>> [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> >>>> [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> >>>> [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> >>>> [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> >>>> [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> >>>> [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> >>>> [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> >>>> ffffffff81073c59
> >>>> [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> >>>> ffff88011e125fd8
> >>>> [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> >>>> ffff88011e125fd8
> >>>>
> >>>> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> >>>> bio. That bio is the submitted, but the submit path seems to get into
> >>>> make_request from raid1.c and that allocates a second bio from
> >>>> bio_alloc() via bio_clone().
> >>>>
> >>>> I am seeing this pattern (swap_writepage calling
> >>>> md_make_request/make_request and then getting stuck in mempool_alloc)
> >>>> more than 5 times in the SysRq+T output...
> >>>
> >>> I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> >>> inside mempool_alloc(), which can be fixed by this patch.
> >>>
> >>> Thanks,
> >>> Fengguang
> >>> ---
> >>>
> >>> concurrent direct page reclaim problem
> >>>
> >>>   __GFP_NORETRY page allocations may fail when there are many concurrent page
> >>>   allocating tasks, but not necessary in real short of memory. The root cause
> >>>   is, tasks will first run direct page reclaim to free some pages from the LRU
> >>>   lists and put them to the per-cpu page lists and the buddy system, and then
> >>>   try to get a free page from there.  However the free pages reclaimed by this
> >>>   task may be consumed by other tasks when the direct reclaim task is able to
> >>>   get the free page for itself.
> >>>
> >>>   Let's retry it a bit harder.
> >>>
> >>> --- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
> >>> +++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
> >>> @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
> >>>  				unsigned long pages_reclaimed)
> >>>  {
> >>>  	/* Do not loop if specifically requested */
> >>> -	if (gfp_mask & __GFP_NORETRY)
> >>> +	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
> >>>  		return 0;
> >>>  
> >>>  	/*
> >>
> >> SLUB usually try high order allocation with __GFP_NORETRY at first. In
> >> other words, It strongly depend on __GFP_NORETRY don't any retry. I'm
> >> worry this...
> > 
> > Right. I noticed that too. Hopefully the "limited" retry won't impact
> > it too much. That said, we do need a better solution than such hacks.
> > 
> >> And, in this case, stucked tasks have PF_MEMALLOC. allocation with PF_MEMALLOC
> >> failure mean this zone have zero memory purely. So, retrying don't solve anything.
> > 
> > The zone has no free (buddy) memory, but has plenty of reclaimable pages.
> > The concurrent page reclaimers may steal pages reclaimed by this task
> > from time to time, but not always. So retry reclaiming will help.
> > 
> >> And I think the root cause is in another.
> >>
> >> bio_clone() use fs_bio_set internally.
> >>
> >> 	struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> >> 	{
> >> 	        struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
> >> 	...
> >>
> >> and fs_bio_set is initialized very small pool size.
> >>
> >> 	#define BIO_POOL_SIZE 2
> >> 	static int __init init_bio(void)
> >> 	{
> >> 		..
> >> 	        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
> > 
> > Agreed. BIO_POOL_SIZE=2 is too small to be deadlock free.
> > 
> >> So, I think raid1.c need to use their own bioset instead fs_bio_set.
> >> otherwise, bio pool exshost can happen very easily.
> > 
> > That would fix the deadlock, but not enough for good IO throughput
> > when multiple CPUs are trying to submit IO. Increasing BIO_POOL_SIZE
> > to a large value should help fix both the deadlock and IO throughput.
> > 
> >> But I'm not sure. I'm not IO expert.
> > 
> > [add CC to Jens]
> 
> We surely need 1 set aside for each level of that stack that will
> potentially consume one. 1 should be enough for the generic pool, and
> then clones will use a separate pool. So md and friends should really
> have a pool per device, so that stacking will always work properly.

Agreed for the deadlock problem.

> There should be no throughput concerns, it should purely be a safe guard
> measure to prevent us deadlocking when doing IO for reclaim.

It's easy to verify whether the minimal size will have negative
impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
by one and check how it performs.

My worry is the pool->alloc(__GFP_NORETRY) in mempool_alloc() may be
failing _frequently_ when there are concurrent reclaimers. Hence the
mem pool will be exhausted from time to time. Although the mem pool
will quickly get out of the exhausted state whenever some flying bio
is freed, if some unlucky task happen to hit allocation failure in the
small exhausted window, it will be going into the 5-seconds sleep in
mempool_alloc().

Oops, the sleep don't need to be UNINTERRUPTIBLE.

---
Subject: mempool: the mempool_alloc() sleep can be interrupted

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/mempool.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/mempool.c	2010-10-22 13:35:17.000000000 +0800
+++ linux-next/mm/mempool.c	2010-10-22 13:35:22.000000000 +0800
@@ -235,7 +235,7 @@ repeat_alloc:
 	/* Now start performing page reclaim */
 	gfp_temp = gfp_mask;
 	init_wait(&wait);
-	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+	prepare_to_wait(&pool->wait, &wait, TASK_INTERRUPTIBLE);
 	smp_mb();
 	if (!pool->curr_nr) {
 		/*

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-22  5:37                             ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-22  5:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On Wed, Oct 20, 2010 at 09:03:04PM +0800, Jens Axboe wrote:
> On 2010-10-20 11:27, Wu Fengguang wrote:
> > On Wed, Oct 20, 2010 at 03:05:56PM +0800, KOSAKI Motohiro wrote:
> >>> On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> >>>> On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> >>>> <just.for.lkml@googlemail.com> wrote:
> >>>>> On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@suse.de> wrote:
> >>>>>> Yes, thanks for the report.
> >>>>>> This is a real bug exactly as you describe.
> >>>>>>
> >>>>>> This is how I think I will fix it, though it needs a bit of review and
> >>>>>> testing before I can be certain.
> >>>>>> Also I need to check raid10 etc to see if they can suffer too.
> >>>>>>
> >>>>>> If you can test it I would really appreciate it.
> >>>>>
> >>>>> I did test it, but while it seemed to fix the deadlock, the system
> >>>>> still got unusable.
> >>>>> The still running "vmstat 1" showed that the swapout was still
> >>>>> progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> >>>>>
> >>>>> I also tried to additionally add Wu's patch:
> >>>>> --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> >>>>> +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
> >>>>> @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> >>>>>               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> >>>>>       }
> >>>>>
> >>>>> +       /*
> >>>>> +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> >>>>> +        * they won't get blocked by normal ones and form circular deadlock.
> >>>>> +        */
> >>>>> +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> >>>>> +               inactive >>= 3;
> >>>>> +
> >>>>>       return isolated > inactive;
> >>>>>
> >>>>> Either it did help somewhat, or I was more lucky on my second try, but
> >>>>> this time I needed ~5 tries instead of only 2 to get the system mostly
> >>>>> stuck again. On the testrun with Wu's patch the writeout pattern was
> >>>>> more stable, a burst of ~80kb each 20 seconds. But I would suspect
> >>>>> that the size of the burst is rather random.
> >>>>>
> >>>>> I do have a complete SysRq+T dump from the first run, I can send that
> >>>>> to anyone how wants it.
> >>>>> (It's 190k so I don't want not spam it to the list)
> >>>>
> >>>> Is this call trace from the SysRq+T violation the rule to only
> >>>> allocate one bio from bio_alloc() until its submitted?
> >>>>
> >>>> [  549.700038] Call Trace:
> >>>> [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> >>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> >>>> [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> >>>> [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> >>>> [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> >>>> [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> >>>> [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> >>>> [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> >>>> [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> >>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> >>>> [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> >>>> [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> >>>> [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> >>>> [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> >>>> [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> >>>> [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> >>>> [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> >>>> [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> >>>> [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> >>>> [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> >>>> [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> >>>> [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> >>>> [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> >>>> [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> >>>> [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> >>>> [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> >>>> [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> >>>> [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> >>>> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> >>>> [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> >>>> [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> >>>> [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> >>>> [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> >>>> [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> >>>> [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> >>>> [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> >>>> ffffffff81073c59
> >>>> [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> >>>> ffff88011e125fd8
> >>>> [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> >>>> ffff88011e125fd8
> >>>>
> >>>> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> >>>> bio. That bio is the submitted, but the submit path seems to get into
> >>>> make_request from raid1.c and that allocates a second bio from
> >>>> bio_alloc() via bio_clone().
> >>>>
> >>>> I am seeing this pattern (swap_writepage calling
> >>>> md_make_request/make_request and then getting stuck in mempool_alloc)
> >>>> more than 5 times in the SysRq+T output...
> >>>
> >>> I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
> >>> inside mempool_alloc(), which can be fixed by this patch.
> >>>
> >>> Thanks,
> >>> Fengguang
> >>> ---
> >>>
> >>> concurrent direct page reclaim problem
> >>>
> >>>   __GFP_NORETRY page allocations may fail when there are many concurrent page
> >>>   allocating tasks, but not necessary in real short of memory. The root cause
> >>>   is, tasks will first run direct page reclaim to free some pages from the LRU
> >>>   lists and put them to the per-cpu page lists and the buddy system, and then
> >>>   try to get a free page from there.  However the free pages reclaimed by this
> >>>   task may be consumed by other tasks when the direct reclaim task is able to
> >>>   get the free page for itself.
> >>>
> >>>   Let's retry it a bit harder.
> >>>
> >>> --- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
> >>> +++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
> >>> @@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
> >>>  				unsigned long pages_reclaimed)
> >>>  {
> >>>  	/* Do not loop if specifically requested */
> >>> -	if (gfp_mask & __GFP_NORETRY)
> >>> +	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
> >>>  		return 0;
> >>>  
> >>>  	/*
> >>
> >> SLUB usually try high order allocation with __GFP_NORETRY at first. In
> >> other words, It strongly depend on __GFP_NORETRY don't any retry. I'm
> >> worry this...
> > 
> > Right. I noticed that too. Hopefully the "limited" retry won't impact
> > it too much. That said, we do need a better solution than such hacks.
> > 
> >> And, in this case, stucked tasks have PF_MEMALLOC. allocation with PF_MEMALLOC
> >> failure mean this zone have zero memory purely. So, retrying don't solve anything.
> > 
> > The zone has no free (buddy) memory, but has plenty of reclaimable pages.
> > The concurrent page reclaimers may steal pages reclaimed by this task
> > from time to time, but not always. So retry reclaiming will help.
> > 
> >> And I think the root cause is in another.
> >>
> >> bio_clone() use fs_bio_set internally.
> >>
> >> 	struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> >> 	{
> >> 	        struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
> >> 	...
> >>
> >> and fs_bio_set is initialized very small pool size.
> >>
> >> 	#define BIO_POOL_SIZE 2
> >> 	static int __init init_bio(void)
> >> 	{
> >> 		..
> >> 	        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
> > 
> > Agreed. BIO_POOL_SIZE=2 is too small to be deadlock free.
> > 
> >> So, I think raid1.c need to use their own bioset instead fs_bio_set.
> >> otherwise, bio pool exshost can happen very easily.
> > 
> > That would fix the deadlock, but not enough for good IO throughput
> > when multiple CPUs are trying to submit IO. Increasing BIO_POOL_SIZE
> > to a large value should help fix both the deadlock and IO throughput.
> > 
> >> But I'm not sure. I'm not IO expert.
> > 
> > [add CC to Jens]
> 
> We surely need 1 set aside for each level of that stack that will
> potentially consume one. 1 should be enough for the generic pool, and
> then clones will use a separate pool. So md and friends should really
> have a pool per device, so that stacking will always work properly.

Agreed for the deadlock problem.

> There should be no throughput concerns, it should purely be a safe guard
> measure to prevent us deadlocking when doing IO for reclaim.

It's easy to verify whether the minimal size will have negative
impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
by one and check how it performs.

My worry is the pool->alloc(__GFP_NORETRY) in mempool_alloc() may be
failing _frequently_ when there are concurrent reclaimers. Hence the
mem pool will be exhausted from time to time. Although the mem pool
will quickly get out of the exhausted state whenever some flying bio
is freed, if some unlucky task happen to hit allocation failure in the
small exhausted window, it will be going into the 5-seconds sleep in
mempool_alloc().

Oops, the sleep don't need to be UNINTERRUPTIBLE.

---
Subject: mempool: the mempool_alloc() sleep can be interrupted

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/mempool.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/mempool.c	2010-10-22 13:35:17.000000000 +0800
+++ linux-next/mm/mempool.c	2010-10-22 13:35:22.000000000 +0800
@@ -235,7 +235,7 @@ repeat_alloc:
 	/* Now start performing page reclaim */
 	gfp_temp = gfp_mask;
 	init_wait(&wait);
-	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+	prepare_to_wait(&pool->wait, &wait, TASK_INTERRUPTIBLE);
 	smp_mb();
 	if (!pool->curr_nr) {
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-22  5:37                             ` Wu Fengguang
@ 2010-10-22  8:07                               ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-22  8:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> > We surely need 1 set aside for each level of that stack that will
> > potentially consume one. 1 should be enough for the generic pool, and
> > then clones will use a separate pool. So md and friends should really
> > have a pool per device, so that stacking will always work properly.
> 
> Agreed for the deadlock problem.
> 
> > There should be no throughput concerns, it should purely be a safe guard
> > measure to prevent us deadlocking when doing IO for reclaim.
> 
> It's easy to verify whether the minimal size will have negative
> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
> by one and check how it performs.

Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
possible deadlocks. We need adding new mempool(s). Because when there
BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
reservation, they will deadlock each other when trying to take the
next bio at the raid1 level.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-22  8:07                               ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-22  8:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

> > We surely need 1 set aside for each level of that stack that will
> > potentially consume one. 1 should be enough for the generic pool, and
> > then clones will use a separate pool. So md and friends should really
> > have a pool per device, so that stacking will always work properly.
> 
> Agreed for the deadlock problem.
> 
> > There should be no throughput concerns, it should purely be a safe guard
> > measure to prevent us deadlocking when doing IO for reclaim.
> 
> It's easy to verify whether the minimal size will have negative
> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
> by one and check how it performs.

Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
possible deadlocks. We need adding new mempool(s). Because when there
BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
reservation, they will deadlock each other when trying to take the
next bio at the raid1 level.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-22  8:07                               ` Wu Fengguang
@ 2010-10-22  8:09                                 ` Jens Axboe
  -1 siblings, 0 replies; 116+ messages in thread
From: Jens Axboe @ 2010-10-22  8:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On 2010-10-22 10:07, Wu Fengguang wrote:
>>> We surely need 1 set aside for each level of that stack that will
>>> potentially consume one. 1 should be enough for the generic pool, and
>>> then clones will use a separate pool. So md and friends should really
>>> have a pool per device, so that stacking will always work properly.
>>
>> Agreed for the deadlock problem.
>>
>>> There should be no throughput concerns, it should purely be a safe guard
>>> measure to prevent us deadlocking when doing IO for reclaim.
>>
>> It's easy to verify whether the minimal size will have negative
>> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
>> by one and check how it performs.
> 
> Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
> possible deadlocks. We need adding new mempool(s). Because when there
> BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
> reservation, they will deadlock each other when trying to take the
> next bio at the raid1 level.

Yes, plus it's not a practical solution since you don't know how deep
the stack is. As I wrote in the initial email, each consumer needs it's
own private mempool (and just 1 entry should suffice).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-22  8:09                                 ` Jens Axboe
  0 siblings, 0 replies; 116+ messages in thread
From: Jens Axboe @ 2010-10-22  8:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On 2010-10-22 10:07, Wu Fengguang wrote:
>>> We surely need 1 set aside for each level of that stack that will
>>> potentially consume one. 1 should be enough for the generic pool, and
>>> then clones will use a separate pool. So md and friends should really
>>> have a pool per device, so that stacking will always work properly.
>>
>> Agreed for the deadlock problem.
>>
>>> There should be no throughput concerns, it should purely be a safe guard
>>> measure to prevent us deadlocking when doing IO for reclaim.
>>
>> It's easy to verify whether the minimal size will have negative
>> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
>> by one and check how it performs.
> 
> Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
> possible deadlocks. We need adding new mempool(s). Because when there
> BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
> reservation, they will deadlock each other when trying to take the
> next bio at the raid1 level.

Yes, plus it's not a practical solution since you don't know how deep
the stack is. As I wrote in the initial email, each consumer needs it's
own private mempool (and just 1 entry should suffice).

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-22  8:09                                 ` Jens Axboe
@ 2010-10-24 16:52                                   ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-24 16:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On Fri, Oct 22, 2010 at 04:09:21PM +0800, Jens Axboe wrote:
> On 2010-10-22 10:07, Wu Fengguang wrote:
> >>> We surely need 1 set aside for each level of that stack that will
> >>> potentially consume one. 1 should be enough for the generic pool, and
> >>> then clones will use a separate pool. So md and friends should really
> >>> have a pool per device, so that stacking will always work properly.
> >>
> >> Agreed for the deadlock problem.
> >>
> >>> There should be no throughput concerns, it should purely be a safe guard
> >>> measure to prevent us deadlocking when doing IO for reclaim.
> >>
> >> It's easy to verify whether the minimal size will have negative
> >> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
> >> by one and check how it performs.
> > 
> > Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
> > possible deadlocks. We need adding new mempool(s). Because when there
> > BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
> > reservation, they will deadlock each other when trying to take the
> > next bio at the raid1 level.
> 
> Yes, plus it's not a practical solution since you don't know how deep
> the stack is. As I wrote in the initial email, each consumer needs it's
> own private mempool (and just 1 entry should suffice).

You are right. The below scratch patch adds minimal mempool code for raid1.
It passed simple stress test of resync + 3 dd writers. Although write
throughput is rather slow in my qemu, I don't observe any
temporary/permanent stuck ups.

 drivers/md/raid1.c  |   32 ++++++++++++++++++++++++++++----
 drivers/md/raid1.h  |    2 ++
 fs/bio.c            |   31 +++++++++++++++++++++----------
 include/linux/bio.h |    2 ++
 4 files changed, 53 insertions(+), 14 deletions(-)

--- linux-next.orig/drivers/md/raid1.c	2010-10-25 00:02:40.000000000 +0800
+++ linux-next/drivers/md/raid1.c	2010-10-25 00:28:16.000000000 +0800
@@ -76,6 +76,14 @@ static void r1bio_pool_free(void *r1_bio
 	kfree(r1_bio);
 }
 
+static void r1_bio_destructor(struct bio *bio)
+{
+	r1bio_t *r1_bio = bio->bi_private;
+	conf_t *conf = r1_bio->mddev->private;
+
+	bio_free(bio, conf->r1_bio_set);
+}
+
 #define RESYNC_BLOCK_SIZE (64*1024)
 //#define RESYNC_BLOCK_SIZE PAGE_SIZE
 #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
@@ -85,6 +93,7 @@ static void r1bio_pool_free(void *r1_bio
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct pool_info *pi = data;
+	conf_t *conf = pi->mddev->private;
 	struct page *page;
 	r1bio_t *r1_bio;
 	struct bio *bio;
@@ -100,7 +109,8 @@ static void * r1buf_pool_alloc(gfp_t gfp
 	 * Allocate bios : 1 for reading, n-1 for writing
 	 */
 	for (j = pi->raid_disks ; j-- ; ) {
-		bio = bio_alloc(gfp_flags, RESYNC_PAGES);
+		bio = bio_alloc_bioset(gfp_flags, RESYNC_PAGES,
+				       conf->r1_bio_set);
 		if (!bio)
 			goto out_free_bio;
 		r1_bio->bios[j] = bio;
@@ -386,6 +396,10 @@ static void raid1_end_write_request(stru
 				!test_bit(R1BIO_Degraded, &r1_bio->state),
 				behind);
 		md_write_end(r1_bio->mddev);
+		if (to_put) {
+			bio_put(to_put);
+			to_put = NULL;
+		}
 		raid_end_bio_io(r1_bio);
 	}
 
@@ -851,7 +865,7 @@ static int make_request(mddev_t *mddev, 
 		}
 		r1_bio->read_disk = rdisk;
 
-		read_bio = bio_clone(bio, GFP_NOIO);
+		read_bio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
 
 		r1_bio->bios[rdisk] = read_bio;
 
@@ -946,7 +960,7 @@ static int make_request(mddev_t *mddev, 
 		if (!r1_bio->bios[i])
 			continue;
 
-		mbio = bio_clone(bio, GFP_NOIO);
+		mbio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
 		r1_bio->bios[i] = mbio;
 
 		mbio->bi_sector	= r1_bio->sector + conf->mirrors[i].rdev->data_offset;
@@ -1646,7 +1660,9 @@ static void raid1d(mddev_t *mddev)
 					mddev->ro ? IO_BLOCKED : NULL;
 				r1_bio->read_disk = disk;
 				bio_put(bio);
-				bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
+				bio = bio_clone_bioset(r1_bio->master_bio,
+						       GFP_NOIO,
+						       conf->r1_bio_set);
 				r1_bio->bios[r1_bio->read_disk] = bio;
 				rdev = conf->mirrors[disk].rdev;
 				if (printk_ratelimit())
@@ -1948,6 +1964,10 @@ static conf_t *setup_conf(mddev_t *mddev
 					  conf->poolinfo);
 	if (!conf->r1bio_pool)
 		goto abort;
+	conf->r1_bio_set = bioset_create(mddev->raid_disks * 2, 0);
+	if (!conf->r1_bio_set)
+		goto abort;
+	conf->r1_bio_set->bio_destructor = r1_bio_destructor;
 
 	conf->poolinfo->mddev = mddev;
 
@@ -2012,6 +2032,8 @@ static conf_t *setup_conf(mddev_t *mddev
 	if (conf) {
 		if (conf->r1bio_pool)
 			mempool_destroy(conf->r1bio_pool);
+		if (conf->r1_bio_set)
+			bioset_free(conf->r1_bio_set);
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
 		kfree(conf->poolinfo);
@@ -2121,6 +2143,8 @@ static int stop(mddev_t *mddev)
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	if (conf->r1bio_pool)
 		mempool_destroy(conf->r1bio_pool);
+	if (conf->r1_bio_set)
+		bioset_free(conf->r1_bio_set);
 	kfree(conf->mirrors);
 	kfree(conf->poolinfo);
 	kfree(conf);
--- linux-next.orig/fs/bio.c	2010-10-25 00:02:39.000000000 +0800
+++ linux-next/fs/bio.c	2010-10-25 00:03:37.000000000 +0800
@@ -306,6 +306,7 @@ out_set:
 	bio->bi_flags |= idx << BIO_POOL_OFFSET;
 	bio->bi_max_vecs = nr_iovecs;
 	bio->bi_io_vec = bvl;
+	bio->bi_destructor = bs->bio_destructor;
 	return bio;
 
 err_free:
@@ -340,12 +341,7 @@ static void bio_fs_destructor(struct bio
  */
 struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
 {
-	struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
-
-	if (bio)
-		bio->bi_destructor = bio_fs_destructor;
-
-	return bio;
+	return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
 }
 EXPORT_SYMBOL(bio_alloc);
 
@@ -460,20 +456,21 @@ void __bio_clone(struct bio *bio, struct
 EXPORT_SYMBOL(__bio_clone);
 
 /**
- *	bio_clone	-	clone a bio
+ *	bio_clone_bioset	-	clone a bio
  *	@bio: bio to clone
  *	@gfp_mask: allocation priority
+ *	@bs: bio_set to allocate from
  *
  * 	Like __bio_clone, only also allocates the returned bio
  */
-struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+struct bio *
+bio_clone_bioset(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 {
-	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
+	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
 
 	if (!b)
 		return NULL;
 
-	b->bi_destructor = bio_fs_destructor;
 	__bio_clone(b, bio);
 
 	if (bio_integrity(bio)) {
@@ -489,6 +486,19 @@ struct bio *bio_clone(struct bio *bio, g
 
 	return b;
 }
+EXPORT_SYMBOL(bio_clone_bioset);
+
+/**
+ *	bio_clone	-	clone a bio
+ *	@bio: bio to clone
+ *	@gfp_mask: allocation priority
+ *
+ *	Like __bio_clone, only also allocates the returned bio
+ */
+struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+{
+	return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
+}
 EXPORT_SYMBOL(bio_clone);
 
 /**
@@ -1664,6 +1674,7 @@ static int __init init_bio(void)
 	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
 	if (!fs_bio_set)
 		panic("bio: can't allocate bios\n");
+	fs_bio_set->bio_destructor = bio_fs_destructor;
 
 	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
 						     sizeof(struct bio_pair));
--- linux-next.orig/include/linux/bio.h	2010-10-25 00:02:40.000000000 +0800
+++ linux-next/include/linux/bio.h	2010-10-25 00:03:37.000000000 +0800
@@ -227,6 +227,7 @@ extern int bio_phys_segments(struct requ
 
 extern void __bio_clone(struct bio *, struct bio *);
 extern struct bio *bio_clone(struct bio *, gfp_t);
+extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *);
 
 extern void bio_init(struct bio *);
 
@@ -299,6 +300,7 @@ struct bio_set {
 	mempool_t *bio_integrity_pool;
 #endif
 	mempool_t *bvec_pool;
+	bio_destructor_t	*bio_destructor;
 };
 
 struct biovec_slab {
--- linux-next.orig/drivers/md/raid1.h	2010-10-25 00:02:40.000000000 +0800
+++ linux-next/drivers/md/raid1.h	2010-10-25 00:03:37.000000000 +0800
@@ -60,6 +60,8 @@ struct r1_private_data_s {
 	mempool_t *r1bio_pool;
 	mempool_t *r1buf_pool;
 
+	struct bio_set *r1_bio_set;
+
 	/* When taking over an array from a different personality, we store
 	 * the new thread here until we fully activate the array.
 	 */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-24 16:52                                   ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-24 16:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Torsten Kaiser, Neil Brown, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On Fri, Oct 22, 2010 at 04:09:21PM +0800, Jens Axboe wrote:
> On 2010-10-22 10:07, Wu Fengguang wrote:
> >>> We surely need 1 set aside for each level of that stack that will
> >>> potentially consume one. 1 should be enough for the generic pool, and
> >>> then clones will use a separate pool. So md and friends should really
> >>> have a pool per device, so that stacking will always work properly.
> >>
> >> Agreed for the deadlock problem.
> >>
> >>> There should be no throughput concerns, it should purely be a safe guard
> >>> measure to prevent us deadlocking when doing IO for reclaim.
> >>
> >> It's easy to verify whether the minimal size will have negative
> >> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
> >> by one and check how it performs.
> > 
> > Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
> > possible deadlocks. We need adding new mempool(s). Because when there
> > BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
> > reservation, they will deadlock each other when trying to take the
> > next bio at the raid1 level.
> 
> Yes, plus it's not a practical solution since you don't know how deep
> the stack is. As I wrote in the initial email, each consumer needs it's
> own private mempool (and just 1 entry should suffice).

You are right. The below scratch patch adds minimal mempool code for raid1.
It passed simple stress test of resync + 3 dd writers. Although write
throughput is rather slow in my qemu, I don't observe any
temporary/permanent stuck ups.

 drivers/md/raid1.c  |   32 ++++++++++++++++++++++++++++----
 drivers/md/raid1.h  |    2 ++
 fs/bio.c            |   31 +++++++++++++++++++++----------
 include/linux/bio.h |    2 ++
 4 files changed, 53 insertions(+), 14 deletions(-)

--- linux-next.orig/drivers/md/raid1.c	2010-10-25 00:02:40.000000000 +0800
+++ linux-next/drivers/md/raid1.c	2010-10-25 00:28:16.000000000 +0800
@@ -76,6 +76,14 @@ static void r1bio_pool_free(void *r1_bio
 	kfree(r1_bio);
 }
 
+static void r1_bio_destructor(struct bio *bio)
+{
+	r1bio_t *r1_bio = bio->bi_private;
+	conf_t *conf = r1_bio->mddev->private;
+
+	bio_free(bio, conf->r1_bio_set);
+}
+
 #define RESYNC_BLOCK_SIZE (64*1024)
 //#define RESYNC_BLOCK_SIZE PAGE_SIZE
 #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
@@ -85,6 +93,7 @@ static void r1bio_pool_free(void *r1_bio
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct pool_info *pi = data;
+	conf_t *conf = pi->mddev->private;
 	struct page *page;
 	r1bio_t *r1_bio;
 	struct bio *bio;
@@ -100,7 +109,8 @@ static void * r1buf_pool_alloc(gfp_t gfp
 	 * Allocate bios : 1 for reading, n-1 for writing
 	 */
 	for (j = pi->raid_disks ; j-- ; ) {
-		bio = bio_alloc(gfp_flags, RESYNC_PAGES);
+		bio = bio_alloc_bioset(gfp_flags, RESYNC_PAGES,
+				       conf->r1_bio_set);
 		if (!bio)
 			goto out_free_bio;
 		r1_bio->bios[j] = bio;
@@ -386,6 +396,10 @@ static void raid1_end_write_request(stru
 				!test_bit(R1BIO_Degraded, &r1_bio->state),
 				behind);
 		md_write_end(r1_bio->mddev);
+		if (to_put) {
+			bio_put(to_put);
+			to_put = NULL;
+		}
 		raid_end_bio_io(r1_bio);
 	}
 
@@ -851,7 +865,7 @@ static int make_request(mddev_t *mddev, 
 		}
 		r1_bio->read_disk = rdisk;
 
-		read_bio = bio_clone(bio, GFP_NOIO);
+		read_bio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
 
 		r1_bio->bios[rdisk] = read_bio;
 
@@ -946,7 +960,7 @@ static int make_request(mddev_t *mddev, 
 		if (!r1_bio->bios[i])
 			continue;
 
-		mbio = bio_clone(bio, GFP_NOIO);
+		mbio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
 		r1_bio->bios[i] = mbio;
 
 		mbio->bi_sector	= r1_bio->sector + conf->mirrors[i].rdev->data_offset;
@@ -1646,7 +1660,9 @@ static void raid1d(mddev_t *mddev)
 					mddev->ro ? IO_BLOCKED : NULL;
 				r1_bio->read_disk = disk;
 				bio_put(bio);
-				bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
+				bio = bio_clone_bioset(r1_bio->master_bio,
+						       GFP_NOIO,
+						       conf->r1_bio_set);
 				r1_bio->bios[r1_bio->read_disk] = bio;
 				rdev = conf->mirrors[disk].rdev;
 				if (printk_ratelimit())
@@ -1948,6 +1964,10 @@ static conf_t *setup_conf(mddev_t *mddev
 					  conf->poolinfo);
 	if (!conf->r1bio_pool)
 		goto abort;
+	conf->r1_bio_set = bioset_create(mddev->raid_disks * 2, 0);
+	if (!conf->r1_bio_set)
+		goto abort;
+	conf->r1_bio_set->bio_destructor = r1_bio_destructor;
 
 	conf->poolinfo->mddev = mddev;
 
@@ -2012,6 +2032,8 @@ static conf_t *setup_conf(mddev_t *mddev
 	if (conf) {
 		if (conf->r1bio_pool)
 			mempool_destroy(conf->r1bio_pool);
+		if (conf->r1_bio_set)
+			bioset_free(conf->r1_bio_set);
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
 		kfree(conf->poolinfo);
@@ -2121,6 +2143,8 @@ static int stop(mddev_t *mddev)
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	if (conf->r1bio_pool)
 		mempool_destroy(conf->r1bio_pool);
+	if (conf->r1_bio_set)
+		bioset_free(conf->r1_bio_set);
 	kfree(conf->mirrors);
 	kfree(conf->poolinfo);
 	kfree(conf);
--- linux-next.orig/fs/bio.c	2010-10-25 00:02:39.000000000 +0800
+++ linux-next/fs/bio.c	2010-10-25 00:03:37.000000000 +0800
@@ -306,6 +306,7 @@ out_set:
 	bio->bi_flags |= idx << BIO_POOL_OFFSET;
 	bio->bi_max_vecs = nr_iovecs;
 	bio->bi_io_vec = bvl;
+	bio->bi_destructor = bs->bio_destructor;
 	return bio;
 
 err_free:
@@ -340,12 +341,7 @@ static void bio_fs_destructor(struct bio
  */
 struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
 {
-	struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
-
-	if (bio)
-		bio->bi_destructor = bio_fs_destructor;
-
-	return bio;
+	return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
 }
 EXPORT_SYMBOL(bio_alloc);
 
@@ -460,20 +456,21 @@ void __bio_clone(struct bio *bio, struct
 EXPORT_SYMBOL(__bio_clone);
 
 /**
- *	bio_clone	-	clone a bio
+ *	bio_clone_bioset	-	clone a bio
  *	@bio: bio to clone
  *	@gfp_mask: allocation priority
+ *	@bs: bio_set to allocate from
  *
  * 	Like __bio_clone, only also allocates the returned bio
  */
-struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+struct bio *
+bio_clone_bioset(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 {
-	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
+	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
 
 	if (!b)
 		return NULL;
 
-	b->bi_destructor = bio_fs_destructor;
 	__bio_clone(b, bio);
 
 	if (bio_integrity(bio)) {
@@ -489,6 +486,19 @@ struct bio *bio_clone(struct bio *bio, g
 
 	return b;
 }
+EXPORT_SYMBOL(bio_clone_bioset);
+
+/**
+ *	bio_clone	-	clone a bio
+ *	@bio: bio to clone
+ *	@gfp_mask: allocation priority
+ *
+ *	Like __bio_clone, only also allocates the returned bio
+ */
+struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+{
+	return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
+}
 EXPORT_SYMBOL(bio_clone);
 
 /**
@@ -1664,6 +1674,7 @@ static int __init init_bio(void)
 	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
 	if (!fs_bio_set)
 		panic("bio: can't allocate bios\n");
+	fs_bio_set->bio_destructor = bio_fs_destructor;
 
 	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
 						     sizeof(struct bio_pair));
--- linux-next.orig/include/linux/bio.h	2010-10-25 00:02:40.000000000 +0800
+++ linux-next/include/linux/bio.h	2010-10-25 00:03:37.000000000 +0800
@@ -227,6 +227,7 @@ extern int bio_phys_segments(struct requ
 
 extern void __bio_clone(struct bio *, struct bio *);
 extern struct bio *bio_clone(struct bio *, gfp_t);
+extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *);
 
 extern void bio_init(struct bio *);
 
@@ -299,6 +300,7 @@ struct bio_set {
 	mempool_t *bio_integrity_pool;
 #endif
 	mempool_t *bvec_pool;
+	bio_destructor_t	*bio_destructor;
 };
 
 struct biovec_slab {
--- linux-next.orig/drivers/md/raid1.h	2010-10-25 00:02:40.000000000 +0800
+++ linux-next/drivers/md/raid1.h	2010-10-25 00:03:37.000000000 +0800
@@ -60,6 +60,8 @@ struct r1_private_data_s {
 	mempool_t *r1bio_pool;
 	mempool_t *r1buf_pool;
 
+	struct bio_set *r1_bio_set;
+
 	/* When taking over an array from a different personality, we store
 	 * the new thread here until we fully activate the array.
 	 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-24 16:52                                   ` Wu Fengguang
@ 2010-10-25  6:40                                     ` Neil Brown
  -1 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-10-25  6:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, KOSAKI Motohiro, Torsten Kaiser, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On Mon, 25 Oct 2010 00:52:34 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Fri, Oct 22, 2010 at 04:09:21PM +0800, Jens Axboe wrote:
> > On 2010-10-22 10:07, Wu Fengguang wrote:
> > >>> We surely need 1 set aside for each level of that stack that will
> > >>> potentially consume one. 1 should be enough for the generic pool, and
> > >>> then clones will use a separate pool. So md and friends should really
> > >>> have a pool per device, so that stacking will always work properly.
> > >>
> > >> Agreed for the deadlock problem.
> > >>
> > >>> There should be no throughput concerns, it should purely be a safe guard
> > >>> measure to prevent us deadlocking when doing IO for reclaim.
> > >>
> > >> It's easy to verify whether the minimal size will have negative
> > >> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
> > >> by one and check how it performs.
> > > 
> > > Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
> > > possible deadlocks. We need adding new mempool(s). Because when there
> > > BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
> > > reservation, they will deadlock each other when trying to take the
> > > next bio at the raid1 level.
> > 
> > Yes, plus it's not a practical solution since you don't know how deep
> > the stack is. As I wrote in the initial email, each consumer needs it's
> > own private mempool (and just 1 entry should suffice).
> 
> You are right. The below scratch patch adds minimal mempool code for raid1.
> It passed simple stress test of resync + 3 dd writers. Although write
> throughput is rather slow in my qemu, I don't observe any
> temporary/permanent stuck ups.

Hi,
  thanks for the patch.  I'll make a few changes to what I finally apply -
  for example we don't really need mempools in r1buf_poll_alloc as that isn't
  on the writeout path - so I'll tidy that up first.

  Also I'll avoid making changes to fs/bio.c at first.  It may still be a
  good idea to have a bio_clone_bioset, but that should be a separate patch -
  there are at least 3 places that would use it.

Thanks - I'll try to get this into the current merge window.

NeilBrown


> 
>  drivers/md/raid1.c  |   32 ++++++++++++++++++++++++++++----
>  drivers/md/raid1.h  |    2 ++
>  fs/bio.c            |   31 +++++++++++++++++++++----------
>  include/linux/bio.h |    2 ++
>  4 files changed, 53 insertions(+), 14 deletions(-)
> 
> --- linux-next.orig/drivers/md/raid1.c	2010-10-25 00:02:40.000000000 +0800
> +++ linux-next/drivers/md/raid1.c	2010-10-25 00:28:16.000000000 +0800
> @@ -76,6 +76,14 @@ static void r1bio_pool_free(void *r1_bio
>  	kfree(r1_bio);
>  }
>  
> +static void r1_bio_destructor(struct bio *bio)
> +{
> +	r1bio_t *r1_bio = bio->bi_private;
> +	conf_t *conf = r1_bio->mddev->private;
> +
> +	bio_free(bio, conf->r1_bio_set);
> +}
> +
>  #define RESYNC_BLOCK_SIZE (64*1024)
>  //#define RESYNC_BLOCK_SIZE PAGE_SIZE
>  #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
> @@ -85,6 +93,7 @@ static void r1bio_pool_free(void *r1_bio
>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>  {
>  	struct pool_info *pi = data;
> +	conf_t *conf = pi->mddev->private;
>  	struct page *page;
>  	r1bio_t *r1_bio;
>  	struct bio *bio;
> @@ -100,7 +109,8 @@ static void * r1buf_pool_alloc(gfp_t gfp
>  	 * Allocate bios : 1 for reading, n-1 for writing
>  	 */
>  	for (j = pi->raid_disks ; j-- ; ) {
> -		bio = bio_alloc(gfp_flags, RESYNC_PAGES);
> +		bio = bio_alloc_bioset(gfp_flags, RESYNC_PAGES,
> +				       conf->r1_bio_set);
>  		if (!bio)
>  			goto out_free_bio;
>  		r1_bio->bios[j] = bio;
> @@ -386,6 +396,10 @@ static void raid1_end_write_request(stru
>  				!test_bit(R1BIO_Degraded, &r1_bio->state),
>  				behind);
>  		md_write_end(r1_bio->mddev);
> +		if (to_put) {
> +			bio_put(to_put);
> +			to_put = NULL;
> +		}
>  		raid_end_bio_io(r1_bio);
>  	}
>  
> @@ -851,7 +865,7 @@ static int make_request(mddev_t *mddev, 
>  		}
>  		r1_bio->read_disk = rdisk;
>  
> -		read_bio = bio_clone(bio, GFP_NOIO);
> +		read_bio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
>  
>  		r1_bio->bios[rdisk] = read_bio;
>  
> @@ -946,7 +960,7 @@ static int make_request(mddev_t *mddev, 
>  		if (!r1_bio->bios[i])
>  			continue;
>  
> -		mbio = bio_clone(bio, GFP_NOIO);
> +		mbio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
>  		r1_bio->bios[i] = mbio;
>  
>  		mbio->bi_sector	= r1_bio->sector + conf->mirrors[i].rdev->data_offset;
> @@ -1646,7 +1660,9 @@ static void raid1d(mddev_t *mddev)
>  					mddev->ro ? IO_BLOCKED : NULL;
>  				r1_bio->read_disk = disk;
>  				bio_put(bio);
> -				bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
> +				bio = bio_clone_bioset(r1_bio->master_bio,
> +						       GFP_NOIO,
> +						       conf->r1_bio_set);
>  				r1_bio->bios[r1_bio->read_disk] = bio;
>  				rdev = conf->mirrors[disk].rdev;
>  				if (printk_ratelimit())
> @@ -1948,6 +1964,10 @@ static conf_t *setup_conf(mddev_t *mddev
>  					  conf->poolinfo);
>  	if (!conf->r1bio_pool)
>  		goto abort;
> +	conf->r1_bio_set = bioset_create(mddev->raid_disks * 2, 0);
> +	if (!conf->r1_bio_set)
> +		goto abort;
> +	conf->r1_bio_set->bio_destructor = r1_bio_destructor;
>  
>  	conf->poolinfo->mddev = mddev;
>  
> @@ -2012,6 +2032,8 @@ static conf_t *setup_conf(mddev_t *mddev
>  	if (conf) {
>  		if (conf->r1bio_pool)
>  			mempool_destroy(conf->r1bio_pool);
> +		if (conf->r1_bio_set)
> +			bioset_free(conf->r1_bio_set);
>  		kfree(conf->mirrors);
>  		safe_put_page(conf->tmppage);
>  		kfree(conf->poolinfo);
> @@ -2121,6 +2143,8 @@ static int stop(mddev_t *mddev)
>  	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
>  	if (conf->r1bio_pool)
>  		mempool_destroy(conf->r1bio_pool);
> +	if (conf->r1_bio_set)
> +		bioset_free(conf->r1_bio_set);
>  	kfree(conf->mirrors);
>  	kfree(conf->poolinfo);
>  	kfree(conf);
> --- linux-next.orig/fs/bio.c	2010-10-25 00:02:39.000000000 +0800
> +++ linux-next/fs/bio.c	2010-10-25 00:03:37.000000000 +0800
> @@ -306,6 +306,7 @@ out_set:
>  	bio->bi_flags |= idx << BIO_POOL_OFFSET;
>  	bio->bi_max_vecs = nr_iovecs;
>  	bio->bi_io_vec = bvl;
> +	bio->bi_destructor = bs->bio_destructor;
>  	return bio;
>  
>  err_free:
> @@ -340,12 +341,7 @@ static void bio_fs_destructor(struct bio
>   */
>  struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
>  {
> -	struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
> -
> -	if (bio)
> -		bio->bi_destructor = bio_fs_destructor;
> -
> -	return bio;
> +	return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
>  }
>  EXPORT_SYMBOL(bio_alloc);
>  
> @@ -460,20 +456,21 @@ void __bio_clone(struct bio *bio, struct
>  EXPORT_SYMBOL(__bio_clone);
>  
>  /**
> - *	bio_clone	-	clone a bio
> + *	bio_clone_bioset	-	clone a bio
>   *	@bio: bio to clone
>   *	@gfp_mask: allocation priority
> + *	@bs: bio_set to allocate from
>   *
>   * 	Like __bio_clone, only also allocates the returned bio
>   */
> -struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> +struct bio *
> +bio_clone_bioset(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
>  {
> -	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
> +	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
>  
>  	if (!b)
>  		return NULL;
>  
> -	b->bi_destructor = bio_fs_destructor;
>  	__bio_clone(b, bio);
>  
>  	if (bio_integrity(bio)) {
> @@ -489,6 +486,19 @@ struct bio *bio_clone(struct bio *bio, g
>  
>  	return b;
>  }
> +EXPORT_SYMBOL(bio_clone_bioset);
> +
> +/**
> + *	bio_clone	-	clone a bio
> + *	@bio: bio to clone
> + *	@gfp_mask: allocation priority
> + *
> + *	Like __bio_clone, only also allocates the returned bio
> + */
> +struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> +{
> +	return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
> +}
>  EXPORT_SYMBOL(bio_clone);
>  
>  /**
> @@ -1664,6 +1674,7 @@ static int __init init_bio(void)
>  	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
>  	if (!fs_bio_set)
>  		panic("bio: can't allocate bios\n");
> +	fs_bio_set->bio_destructor = bio_fs_destructor;
>  
>  	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
>  						     sizeof(struct bio_pair));
> --- linux-next.orig/include/linux/bio.h	2010-10-25 00:02:40.000000000 +0800
> +++ linux-next/include/linux/bio.h	2010-10-25 00:03:37.000000000 +0800
> @@ -227,6 +227,7 @@ extern int bio_phys_segments(struct requ
>  
>  extern void __bio_clone(struct bio *, struct bio *);
>  extern struct bio *bio_clone(struct bio *, gfp_t);
> +extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *);
>  
>  extern void bio_init(struct bio *);
>  
> @@ -299,6 +300,7 @@ struct bio_set {
>  	mempool_t *bio_integrity_pool;
>  #endif
>  	mempool_t *bvec_pool;
> +	bio_destructor_t	*bio_destructor;
>  };
>  
>  struct biovec_slab {
> --- linux-next.orig/drivers/md/raid1.h	2010-10-25 00:02:40.000000000 +0800
> +++ linux-next/drivers/md/raid1.h	2010-10-25 00:03:37.000000000 +0800
> @@ -60,6 +60,8 @@ struct r1_private_data_s {
>  	mempool_t *r1bio_pool;
>  	mempool_t *r1buf_pool;
>  
> +	struct bio_set *r1_bio_set;
> +
>  	/* When taking over an array from a different personality, we store
>  	 * the new thread here until we fully activate the array.
>  	 */


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-25  6:40                                     ` Neil Brown
  0 siblings, 0 replies; 116+ messages in thread
From: Neil Brown @ 2010-10-25  6:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, KOSAKI Motohiro, Torsten Kaiser, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On Mon, 25 Oct 2010 00:52:34 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Fri, Oct 22, 2010 at 04:09:21PM +0800, Jens Axboe wrote:
> > On 2010-10-22 10:07, Wu Fengguang wrote:
> > >>> We surely need 1 set aside for each level of that stack that will
> > >>> potentially consume one. 1 should be enough for the generic pool, and
> > >>> then clones will use a separate pool. So md and friends should really
> > >>> have a pool per device, so that stacking will always work properly.
> > >>
> > >> Agreed for the deadlock problem.
> > >>
> > >>> There should be no throughput concerns, it should purely be a safe guard
> > >>> measure to prevent us deadlocking when doing IO for reclaim.
> > >>
> > >> It's easy to verify whether the minimal size will have negative
> > >> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
> > >> by one and check how it performs.
> > > 
> > > Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
> > > possible deadlocks. We need adding new mempool(s). Because when there
> > > BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
> > > reservation, they will deadlock each other when trying to take the
> > > next bio at the raid1 level.
> > 
> > Yes, plus it's not a practical solution since you don't know how deep
> > the stack is. As I wrote in the initial email, each consumer needs it's
> > own private mempool (and just 1 entry should suffice).
> 
> You are right. The below scratch patch adds minimal mempool code for raid1.
> It passed simple stress test of resync + 3 dd writers. Although write
> throughput is rather slow in my qemu, I don't observe any
> temporary/permanent stuck ups.

Hi,
  thanks for the patch.  I'll make a few changes to what I finally apply -
  for example we don't really need mempools in r1buf_poll_alloc as that isn't
  on the writeout path - so I'll tidy that up first.

  Also I'll avoid making changes to fs/bio.c at first.  It may still be a
  good idea to have a bio_clone_bioset, but that should be a separate patch -
  there are at least 3 places that would use it.

Thanks - I'll try to get this into the current merge window.

NeilBrown


> 
>  drivers/md/raid1.c  |   32 ++++++++++++++++++++++++++++----
>  drivers/md/raid1.h  |    2 ++
>  fs/bio.c            |   31 +++++++++++++++++++++----------
>  include/linux/bio.h |    2 ++
>  4 files changed, 53 insertions(+), 14 deletions(-)
> 
> --- linux-next.orig/drivers/md/raid1.c	2010-10-25 00:02:40.000000000 +0800
> +++ linux-next/drivers/md/raid1.c	2010-10-25 00:28:16.000000000 +0800
> @@ -76,6 +76,14 @@ static void r1bio_pool_free(void *r1_bio
>  	kfree(r1_bio);
>  }
>  
> +static void r1_bio_destructor(struct bio *bio)
> +{
> +	r1bio_t *r1_bio = bio->bi_private;
> +	conf_t *conf = r1_bio->mddev->private;
> +
> +	bio_free(bio, conf->r1_bio_set);
> +}
> +
>  #define RESYNC_BLOCK_SIZE (64*1024)
>  //#define RESYNC_BLOCK_SIZE PAGE_SIZE
>  #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
> @@ -85,6 +93,7 @@ static void r1bio_pool_free(void *r1_bio
>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>  {
>  	struct pool_info *pi = data;
> +	conf_t *conf = pi->mddev->private;
>  	struct page *page;
>  	r1bio_t *r1_bio;
>  	struct bio *bio;
> @@ -100,7 +109,8 @@ static void * r1buf_pool_alloc(gfp_t gfp
>  	 * Allocate bios : 1 for reading, n-1 for writing
>  	 */
>  	for (j = pi->raid_disks ; j-- ; ) {
> -		bio = bio_alloc(gfp_flags, RESYNC_PAGES);
> +		bio = bio_alloc_bioset(gfp_flags, RESYNC_PAGES,
> +				       conf->r1_bio_set);
>  		if (!bio)
>  			goto out_free_bio;
>  		r1_bio->bios[j] = bio;
> @@ -386,6 +396,10 @@ static void raid1_end_write_request(stru
>  				!test_bit(R1BIO_Degraded, &r1_bio->state),
>  				behind);
>  		md_write_end(r1_bio->mddev);
> +		if (to_put) {
> +			bio_put(to_put);
> +			to_put = NULL;
> +		}
>  		raid_end_bio_io(r1_bio);
>  	}
>  
> @@ -851,7 +865,7 @@ static int make_request(mddev_t *mddev, 
>  		}
>  		r1_bio->read_disk = rdisk;
>  
> -		read_bio = bio_clone(bio, GFP_NOIO);
> +		read_bio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
>  
>  		r1_bio->bios[rdisk] = read_bio;
>  
> @@ -946,7 +960,7 @@ static int make_request(mddev_t *mddev, 
>  		if (!r1_bio->bios[i])
>  			continue;
>  
> -		mbio = bio_clone(bio, GFP_NOIO);
> +		mbio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
>  		r1_bio->bios[i] = mbio;
>  
>  		mbio->bi_sector	= r1_bio->sector + conf->mirrors[i].rdev->data_offset;
> @@ -1646,7 +1660,9 @@ static void raid1d(mddev_t *mddev)
>  					mddev->ro ? IO_BLOCKED : NULL;
>  				r1_bio->read_disk = disk;
>  				bio_put(bio);
> -				bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
> +				bio = bio_clone_bioset(r1_bio->master_bio,
> +						       GFP_NOIO,
> +						       conf->r1_bio_set);
>  				r1_bio->bios[r1_bio->read_disk] = bio;
>  				rdev = conf->mirrors[disk].rdev;
>  				if (printk_ratelimit())
> @@ -1948,6 +1964,10 @@ static conf_t *setup_conf(mddev_t *mddev
>  					  conf->poolinfo);
>  	if (!conf->r1bio_pool)
>  		goto abort;
> +	conf->r1_bio_set = bioset_create(mddev->raid_disks * 2, 0);
> +	if (!conf->r1_bio_set)
> +		goto abort;
> +	conf->r1_bio_set->bio_destructor = r1_bio_destructor;
>  
>  	conf->poolinfo->mddev = mddev;
>  
> @@ -2012,6 +2032,8 @@ static conf_t *setup_conf(mddev_t *mddev
>  	if (conf) {
>  		if (conf->r1bio_pool)
>  			mempool_destroy(conf->r1bio_pool);
> +		if (conf->r1_bio_set)
> +			bioset_free(conf->r1_bio_set);
>  		kfree(conf->mirrors);
>  		safe_put_page(conf->tmppage);
>  		kfree(conf->poolinfo);
> @@ -2121,6 +2143,8 @@ static int stop(mddev_t *mddev)
>  	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
>  	if (conf->r1bio_pool)
>  		mempool_destroy(conf->r1bio_pool);
> +	if (conf->r1_bio_set)
> +		bioset_free(conf->r1_bio_set);
>  	kfree(conf->mirrors);
>  	kfree(conf->poolinfo);
>  	kfree(conf);
> --- linux-next.orig/fs/bio.c	2010-10-25 00:02:39.000000000 +0800
> +++ linux-next/fs/bio.c	2010-10-25 00:03:37.000000000 +0800
> @@ -306,6 +306,7 @@ out_set:
>  	bio->bi_flags |= idx << BIO_POOL_OFFSET;
>  	bio->bi_max_vecs = nr_iovecs;
>  	bio->bi_io_vec = bvl;
> +	bio->bi_destructor = bs->bio_destructor;
>  	return bio;
>  
>  err_free:
> @@ -340,12 +341,7 @@ static void bio_fs_destructor(struct bio
>   */
>  struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
>  {
> -	struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
> -
> -	if (bio)
> -		bio->bi_destructor = bio_fs_destructor;
> -
> -	return bio;
> +	return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
>  }
>  EXPORT_SYMBOL(bio_alloc);
>  
> @@ -460,20 +456,21 @@ void __bio_clone(struct bio *bio, struct
>  EXPORT_SYMBOL(__bio_clone);
>  
>  /**
> - *	bio_clone	-	clone a bio
> + *	bio_clone_bioset	-	clone a bio
>   *	@bio: bio to clone
>   *	@gfp_mask: allocation priority
> + *	@bs: bio_set to allocate from
>   *
>   * 	Like __bio_clone, only also allocates the returned bio
>   */
> -struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> +struct bio *
> +bio_clone_bioset(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
>  {
> -	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
> +	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
>  
>  	if (!b)
>  		return NULL;
>  
> -	b->bi_destructor = bio_fs_destructor;
>  	__bio_clone(b, bio);
>  
>  	if (bio_integrity(bio)) {
> @@ -489,6 +486,19 @@ struct bio *bio_clone(struct bio *bio, g
>  
>  	return b;
>  }
> +EXPORT_SYMBOL(bio_clone_bioset);
> +
> +/**
> + *	bio_clone	-	clone a bio
> + *	@bio: bio to clone
> + *	@gfp_mask: allocation priority
> + *
> + *	Like __bio_clone, only also allocates the returned bio
> + */
> +struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> +{
> +	return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
> +}
>  EXPORT_SYMBOL(bio_clone);
>  
>  /**
> @@ -1664,6 +1674,7 @@ static int __init init_bio(void)
>  	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
>  	if (!fs_bio_set)
>  		panic("bio: can't allocate bios\n");
> +	fs_bio_set->bio_destructor = bio_fs_destructor;
>  
>  	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
>  						     sizeof(struct bio_pair));
> --- linux-next.orig/include/linux/bio.h	2010-10-25 00:02:40.000000000 +0800
> +++ linux-next/include/linux/bio.h	2010-10-25 00:03:37.000000000 +0800
> @@ -227,6 +227,7 @@ extern int bio_phys_segments(struct requ
>  
>  extern void __bio_clone(struct bio *, struct bio *);
>  extern struct bio *bio_clone(struct bio *, gfp_t);
> +extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *);
>  
>  extern void bio_init(struct bio *);
>  
> @@ -299,6 +300,7 @@ struct bio_set {
>  	mempool_t *bio_integrity_pool;
>  #endif
>  	mempool_t *bvec_pool;
> +	bio_destructor_t	*bio_destructor;
>  };
>  
>  struct biovec_slab {
> --- linux-next.orig/drivers/md/raid1.h	2010-10-25 00:02:40.000000000 +0800
> +++ linux-next/drivers/md/raid1.h	2010-10-25 00:03:37.000000000 +0800
> @@ -60,6 +60,8 @@ struct r1_private_data_s {
>  	mempool_t *r1bio_pool;
>  	mempool_t *r1buf_pool;
>  
> +	struct bio_set *r1_bio_set;
> +
>  	/* When taking over an array from a different personality, we store
>  	 * the new thread here until we fully activate the array.
>  	 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
  2010-10-25  6:40                                     ` Neil Brown
@ 2010-10-25  7:26                                       ` Wu Fengguang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-25  7:26 UTC (permalink / raw)
  To: Neil Brown
  Cc: Jens Axboe, KOSAKI Motohiro, Torsten Kaiser, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On Mon, Oct 25, 2010 at 02:40:51PM +0800, Neil Brown wrote:
> On Mon, 25 Oct 2010 00:52:34 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Fri, Oct 22, 2010 at 04:09:21PM +0800, Jens Axboe wrote:
> > > On 2010-10-22 10:07, Wu Fengguang wrote:
> > > >>> We surely need 1 set aside for each level of that stack that will
> > > >>> potentially consume one. 1 should be enough for the generic pool, and
> > > >>> then clones will use a separate pool. So md and friends should really
> > > >>> have a pool per device, so that stacking will always work properly.
> > > >>
> > > >> Agreed for the deadlock problem.
> > > >>
> > > >>> There should be no throughput concerns, it should purely be a safe guard
> > > >>> measure to prevent us deadlocking when doing IO for reclaim.
> > > >>
> > > >> It's easy to verify whether the minimal size will have negative
> > > >> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
> > > >> by one and check how it performs.
> > > > 
> > > > Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
> > > > possible deadlocks. We need adding new mempool(s). Because when there
> > > > BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
> > > > reservation, they will deadlock each other when trying to take the
> > > > next bio at the raid1 level.
> > > 
> > > Yes, plus it's not a practical solution since you don't know how deep
> > > the stack is. As I wrote in the initial email, each consumer needs it's
> > > own private mempool (and just 1 entry should suffice).
> > 
> > You are right. The below scratch patch adds minimal mempool code for raid1.
> > It passed simple stress test of resync + 3 dd writers. Although write
> > throughput is rather slow in my qemu, I don't observe any
> > temporary/permanent stuck ups.
> 
> Hi,
>   thanks for the patch.  I'll make a few changes to what I finally apply -
>   for example we don't really need mempools in r1buf_poll_alloc as that isn't
>   on the writeout path - so I'll tidy that up first.

OK. That change is not absolutely necessary for the deadlock fix.

It's done just in hope of improving things a bit under memory
pressure: r1buf_poll_alloc() allocates N bios at one time, which might
temporarily exhaust BIO_POOL_SIZE. Since that path is independent of
the normal write path, so I simply reuse the r1_bio_set.

>   Also I'll avoid making changes to fs/bio.c at first.  It may still be a
>   good idea to have a bio_clone_bioset, but that should be a separate patch -
>   there are at least 3 places that would use it.

Fair enough. I did the

        fs_bio_set->bio_destructor = bio_fs_destructor;

hack for the same reason: it's better to pass the destructor func as
a parameter to bioset_create(), however that requires changing more
places.

> Thanks - I'll try to get this into the current merge window.

Thank you!

Thanks,
Fengguang

> > 
> >  drivers/md/raid1.c  |   32 ++++++++++++++++++++++++++++----
> >  drivers/md/raid1.h  |    2 ++
> >  fs/bio.c            |   31 +++++++++++++++++++++----------
> >  include/linux/bio.h |    2 ++
> >  4 files changed, 53 insertions(+), 14 deletions(-)
> > 
> > --- linux-next.orig/drivers/md/raid1.c	2010-10-25 00:02:40.000000000 +0800
> > +++ linux-next/drivers/md/raid1.c	2010-10-25 00:28:16.000000000 +0800
> > @@ -76,6 +76,14 @@ static void r1bio_pool_free(void *r1_bio
> >  	kfree(r1_bio);
> >  }
> >  
> > +static void r1_bio_destructor(struct bio *bio)
> > +{
> > +	r1bio_t *r1_bio = bio->bi_private;
> > +	conf_t *conf = r1_bio->mddev->private;
> > +
> > +	bio_free(bio, conf->r1_bio_set);
> > +}
> > +
> >  #define RESYNC_BLOCK_SIZE (64*1024)
> >  //#define RESYNC_BLOCK_SIZE PAGE_SIZE
> >  #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
> > @@ -85,6 +93,7 @@ static void r1bio_pool_free(void *r1_bio
> >  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
> >  {
> >  	struct pool_info *pi = data;
> > +	conf_t *conf = pi->mddev->private;
> >  	struct page *page;
> >  	r1bio_t *r1_bio;
> >  	struct bio *bio;
> > @@ -100,7 +109,8 @@ static void * r1buf_pool_alloc(gfp_t gfp
> >  	 * Allocate bios : 1 for reading, n-1 for writing
> >  	 */
> >  	for (j = pi->raid_disks ; j-- ; ) {
> > -		bio = bio_alloc(gfp_flags, RESYNC_PAGES);
> > +		bio = bio_alloc_bioset(gfp_flags, RESYNC_PAGES,
> > +				       conf->r1_bio_set);
> >  		if (!bio)
> >  			goto out_free_bio;
> >  		r1_bio->bios[j] = bio;
> > @@ -386,6 +396,10 @@ static void raid1_end_write_request(stru
> >  				!test_bit(R1BIO_Degraded, &r1_bio->state),
> >  				behind);
> >  		md_write_end(r1_bio->mddev);
> > +		if (to_put) {
> > +			bio_put(to_put);
> > +			to_put = NULL;
> > +		}
> >  		raid_end_bio_io(r1_bio);
> >  	}
> >  
> > @@ -851,7 +865,7 @@ static int make_request(mddev_t *mddev, 
> >  		}
> >  		r1_bio->read_disk = rdisk;
> >  
> > -		read_bio = bio_clone(bio, GFP_NOIO);
> > +		read_bio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
> >  
> >  		r1_bio->bios[rdisk] = read_bio;
> >  
> > @@ -946,7 +960,7 @@ static int make_request(mddev_t *mddev, 
> >  		if (!r1_bio->bios[i])
> >  			continue;
> >  
> > -		mbio = bio_clone(bio, GFP_NOIO);
> > +		mbio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
> >  		r1_bio->bios[i] = mbio;
> >  
> >  		mbio->bi_sector	= r1_bio->sector + conf->mirrors[i].rdev->data_offset;
> > @@ -1646,7 +1660,9 @@ static void raid1d(mddev_t *mddev)
> >  					mddev->ro ? IO_BLOCKED : NULL;
> >  				r1_bio->read_disk = disk;
> >  				bio_put(bio);
> > -				bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
> > +				bio = bio_clone_bioset(r1_bio->master_bio,
> > +						       GFP_NOIO,
> > +						       conf->r1_bio_set);
> >  				r1_bio->bios[r1_bio->read_disk] = bio;
> >  				rdev = conf->mirrors[disk].rdev;
> >  				if (printk_ratelimit())
> > @@ -1948,6 +1964,10 @@ static conf_t *setup_conf(mddev_t *mddev
> >  					  conf->poolinfo);
> >  	if (!conf->r1bio_pool)
> >  		goto abort;
> > +	conf->r1_bio_set = bioset_create(mddev->raid_disks * 2, 0);
> > +	if (!conf->r1_bio_set)
> > +		goto abort;
> > +	conf->r1_bio_set->bio_destructor = r1_bio_destructor;
> >  
> >  	conf->poolinfo->mddev = mddev;
> >  
> > @@ -2012,6 +2032,8 @@ static conf_t *setup_conf(mddev_t *mddev
> >  	if (conf) {
> >  		if (conf->r1bio_pool)
> >  			mempool_destroy(conf->r1bio_pool);
> > +		if (conf->r1_bio_set)
> > +			bioset_free(conf->r1_bio_set);
> >  		kfree(conf->mirrors);
> >  		safe_put_page(conf->tmppage);
> >  		kfree(conf->poolinfo);
> > @@ -2121,6 +2143,8 @@ static int stop(mddev_t *mddev)
> >  	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
> >  	if (conf->r1bio_pool)
> >  		mempool_destroy(conf->r1bio_pool);
> > +	if (conf->r1_bio_set)
> > +		bioset_free(conf->r1_bio_set);
> >  	kfree(conf->mirrors);
> >  	kfree(conf->poolinfo);
> >  	kfree(conf);
> > --- linux-next.orig/fs/bio.c	2010-10-25 00:02:39.000000000 +0800
> > +++ linux-next/fs/bio.c	2010-10-25 00:03:37.000000000 +0800
> > @@ -306,6 +306,7 @@ out_set:
> >  	bio->bi_flags |= idx << BIO_POOL_OFFSET;
> >  	bio->bi_max_vecs = nr_iovecs;
> >  	bio->bi_io_vec = bvl;
> > +	bio->bi_destructor = bs->bio_destructor;
> >  	return bio;
> >  
> >  err_free:
> > @@ -340,12 +341,7 @@ static void bio_fs_destructor(struct bio
> >   */
> >  struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
> >  {
> > -	struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
> > -
> > -	if (bio)
> > -		bio->bi_destructor = bio_fs_destructor;
> > -
> > -	return bio;
> > +	return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
> >  }
> >  EXPORT_SYMBOL(bio_alloc);
> >  
> > @@ -460,20 +456,21 @@ void __bio_clone(struct bio *bio, struct
> >  EXPORT_SYMBOL(__bio_clone);
> >  
> >  /**
> > - *	bio_clone	-	clone a bio
> > + *	bio_clone_bioset	-	clone a bio
> >   *	@bio: bio to clone
> >   *	@gfp_mask: allocation priority
> > + *	@bs: bio_set to allocate from
> >   *
> >   * 	Like __bio_clone, only also allocates the returned bio
> >   */
> > -struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> > +struct bio *
> > +bio_clone_bioset(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
> >  {
> > -	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
> > +	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
> >  
> >  	if (!b)
> >  		return NULL;
> >  
> > -	b->bi_destructor = bio_fs_destructor;
> >  	__bio_clone(b, bio);
> >  
> >  	if (bio_integrity(bio)) {
> > @@ -489,6 +486,19 @@ struct bio *bio_clone(struct bio *bio, g
> >  
> >  	return b;
> >  }
> > +EXPORT_SYMBOL(bio_clone_bioset);
> > +
> > +/**
> > + *	bio_clone	-	clone a bio
> > + *	@bio: bio to clone
> > + *	@gfp_mask: allocation priority
> > + *
> > + *	Like __bio_clone, only also allocates the returned bio
> > + */
> > +struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> > +{
> > +	return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
> > +}
> >  EXPORT_SYMBOL(bio_clone);
> >  
> >  /**
> > @@ -1664,6 +1674,7 @@ static int __init init_bio(void)
> >  	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
> >  	if (!fs_bio_set)
> >  		panic("bio: can't allocate bios\n");
> > +	fs_bio_set->bio_destructor = bio_fs_destructor;
> >  
> >  	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
> >  						     sizeof(struct bio_pair));
> > --- linux-next.orig/include/linux/bio.h	2010-10-25 00:02:40.000000000 +0800
> > +++ linux-next/include/linux/bio.h	2010-10-25 00:03:37.000000000 +0800
> > @@ -227,6 +227,7 @@ extern int bio_phys_segments(struct requ
> >  
> >  extern void __bio_clone(struct bio *, struct bio *);
> >  extern struct bio *bio_clone(struct bio *, gfp_t);
> > +extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *);
> >  
> >  extern void bio_init(struct bio *);
> >  
> > @@ -299,6 +300,7 @@ struct bio_set {
> >  	mempool_t *bio_integrity_pool;
> >  #endif
> >  	mempool_t *bvec_pool;
> > +	bio_destructor_t	*bio_destructor;
> >  };
> >  
> >  struct biovec_slab {
> > --- linux-next.orig/drivers/md/raid1.h	2010-10-25 00:02:40.000000000 +0800
> > +++ linux-next/drivers/md/raid1.h	2010-10-25 00:03:37.000000000 +0800
> > @@ -60,6 +60,8 @@ struct r1_private_data_s {
> >  	mempool_t *r1bio_pool;
> >  	mempool_t *r1buf_pool;
> >  
> > +	struct bio_set *r1_bio_set;
> > +
> >  	/* When taking over an array from a different personality, we store
> >  	 * the new thread here until we fully activate the array.
> >  	 */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Deadlock possibly caused by too_many_isolated.
@ 2010-10-25  7:26                                       ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-10-25  7:26 UTC (permalink / raw)
  To: Neil Brown
  Cc: Jens Axboe, KOSAKI Motohiro, Torsten Kaiser, Rik van Riel,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel, linux-mm, Li,
	Shaohua

On Mon, Oct 25, 2010 at 02:40:51PM +0800, Neil Brown wrote:
> On Mon, 25 Oct 2010 00:52:34 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Fri, Oct 22, 2010 at 04:09:21PM +0800, Jens Axboe wrote:
> > > On 2010-10-22 10:07, Wu Fengguang wrote:
> > > >>> We surely need 1 set aside for each level of that stack that will
> > > >>> potentially consume one. 1 should be enough for the generic pool, and
> > > >>> then clones will use a separate pool. So md and friends should really
> > > >>> have a pool per device, so that stacking will always work properly.
> > > >>
> > > >> Agreed for the deadlock problem.
> > > >>
> > > >>> There should be no throughput concerns, it should purely be a safe guard
> > > >>> measure to prevent us deadlocking when doing IO for reclaim.
> > > >>
> > > >> It's easy to verify whether the minimal size will have negative
> > > >> impacts on IO throughput. In Torsten's case, increase BIO_POOL_SIZE
> > > >> by one and check how it performs.
> > > > 
> > > > Sorry it seems simply increasing BIO_POOL_SIZE is not enough to fix
> > > > possible deadlocks. We need adding new mempool(s). Because when there
> > > > BIO_POOL_SIZE=2 and there are two concurrent reclaimers each take 1
> > > > reservation, they will deadlock each other when trying to take the
> > > > next bio at the raid1 level.
> > > 
> > > Yes, plus it's not a practical solution since you don't know how deep
> > > the stack is. As I wrote in the initial email, each consumer needs it's
> > > own private mempool (and just 1 entry should suffice).
> > 
> > You are right. The below scratch patch adds minimal mempool code for raid1.
> > It passed simple stress test of resync + 3 dd writers. Although write
> > throughput is rather slow in my qemu, I don't observe any
> > temporary/permanent stuck ups.
> 
> Hi,
>   thanks for the patch.  I'll make a few changes to what I finally apply -
>   for example we don't really need mempools in r1buf_poll_alloc as that isn't
>   on the writeout path - so I'll tidy that up first.

OK. That change is not absolutely necessary for the deadlock fix.

It's done just in hope of improving things a bit under memory
pressure: r1buf_poll_alloc() allocates N bios at one time, which might
temporarily exhaust BIO_POOL_SIZE. Since that path is independent of
the normal write path, so I simply reuse the r1_bio_set.

>   Also I'll avoid making changes to fs/bio.c at first.  It may still be a
>   good idea to have a bio_clone_bioset, but that should be a separate patch -
>   there are at least 3 places that would use it.

Fair enough. I did the

        fs_bio_set->bio_destructor = bio_fs_destructor;

hack for the same reason: it's better to pass the destructor func as
a parameter to bioset_create(), however that requires changing more
places.

> Thanks - I'll try to get this into the current merge window.

Thank you!

Thanks,
Fengguang

> > 
> >  drivers/md/raid1.c  |   32 ++++++++++++++++++++++++++++----
> >  drivers/md/raid1.h  |    2 ++
> >  fs/bio.c            |   31 +++++++++++++++++++++----------
> >  include/linux/bio.h |    2 ++
> >  4 files changed, 53 insertions(+), 14 deletions(-)
> > 
> > --- linux-next.orig/drivers/md/raid1.c	2010-10-25 00:02:40.000000000 +0800
> > +++ linux-next/drivers/md/raid1.c	2010-10-25 00:28:16.000000000 +0800
> > @@ -76,6 +76,14 @@ static void r1bio_pool_free(void *r1_bio
> >  	kfree(r1_bio);
> >  }
> >  
> > +static void r1_bio_destructor(struct bio *bio)
> > +{
> > +	r1bio_t *r1_bio = bio->bi_private;
> > +	conf_t *conf = r1_bio->mddev->private;
> > +
> > +	bio_free(bio, conf->r1_bio_set);
> > +}
> > +
> >  #define RESYNC_BLOCK_SIZE (64*1024)
> >  //#define RESYNC_BLOCK_SIZE PAGE_SIZE
> >  #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
> > @@ -85,6 +93,7 @@ static void r1bio_pool_free(void *r1_bio
> >  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
> >  {
> >  	struct pool_info *pi = data;
> > +	conf_t *conf = pi->mddev->private;
> >  	struct page *page;
> >  	r1bio_t *r1_bio;
> >  	struct bio *bio;
> > @@ -100,7 +109,8 @@ static void * r1buf_pool_alloc(gfp_t gfp
> >  	 * Allocate bios : 1 for reading, n-1 for writing
> >  	 */
> >  	for (j = pi->raid_disks ; j-- ; ) {
> > -		bio = bio_alloc(gfp_flags, RESYNC_PAGES);
> > +		bio = bio_alloc_bioset(gfp_flags, RESYNC_PAGES,
> > +				       conf->r1_bio_set);
> >  		if (!bio)
> >  			goto out_free_bio;
> >  		r1_bio->bios[j] = bio;
> > @@ -386,6 +396,10 @@ static void raid1_end_write_request(stru
> >  				!test_bit(R1BIO_Degraded, &r1_bio->state),
> >  				behind);
> >  		md_write_end(r1_bio->mddev);
> > +		if (to_put) {
> > +			bio_put(to_put);
> > +			to_put = NULL;
> > +		}
> >  		raid_end_bio_io(r1_bio);
> >  	}
> >  
> > @@ -851,7 +865,7 @@ static int make_request(mddev_t *mddev, 
> >  		}
> >  		r1_bio->read_disk = rdisk;
> >  
> > -		read_bio = bio_clone(bio, GFP_NOIO);
> > +		read_bio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
> >  
> >  		r1_bio->bios[rdisk] = read_bio;
> >  
> > @@ -946,7 +960,7 @@ static int make_request(mddev_t *mddev, 
> >  		if (!r1_bio->bios[i])
> >  			continue;
> >  
> > -		mbio = bio_clone(bio, GFP_NOIO);
> > +		mbio = bio_clone_bioset(bio, GFP_NOIO, conf->r1_bio_set);
> >  		r1_bio->bios[i] = mbio;
> >  
> >  		mbio->bi_sector	= r1_bio->sector + conf->mirrors[i].rdev->data_offset;
> > @@ -1646,7 +1660,9 @@ static void raid1d(mddev_t *mddev)
> >  					mddev->ro ? IO_BLOCKED : NULL;
> >  				r1_bio->read_disk = disk;
> >  				bio_put(bio);
> > -				bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
> > +				bio = bio_clone_bioset(r1_bio->master_bio,
> > +						       GFP_NOIO,
> > +						       conf->r1_bio_set);
> >  				r1_bio->bios[r1_bio->read_disk] = bio;
> >  				rdev = conf->mirrors[disk].rdev;
> >  				if (printk_ratelimit())
> > @@ -1948,6 +1964,10 @@ static conf_t *setup_conf(mddev_t *mddev
> >  					  conf->poolinfo);
> >  	if (!conf->r1bio_pool)
> >  		goto abort;
> > +	conf->r1_bio_set = bioset_create(mddev->raid_disks * 2, 0);
> > +	if (!conf->r1_bio_set)
> > +		goto abort;
> > +	conf->r1_bio_set->bio_destructor = r1_bio_destructor;
> >  
> >  	conf->poolinfo->mddev = mddev;
> >  
> > @@ -2012,6 +2032,8 @@ static conf_t *setup_conf(mddev_t *mddev
> >  	if (conf) {
> >  		if (conf->r1bio_pool)
> >  			mempool_destroy(conf->r1bio_pool);
> > +		if (conf->r1_bio_set)
> > +			bioset_free(conf->r1_bio_set);
> >  		kfree(conf->mirrors);
> >  		safe_put_page(conf->tmppage);
> >  		kfree(conf->poolinfo);
> > @@ -2121,6 +2143,8 @@ static int stop(mddev_t *mddev)
> >  	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
> >  	if (conf->r1bio_pool)
> >  		mempool_destroy(conf->r1bio_pool);
> > +	if (conf->r1_bio_set)
> > +		bioset_free(conf->r1_bio_set);
> >  	kfree(conf->mirrors);
> >  	kfree(conf->poolinfo);
> >  	kfree(conf);
> > --- linux-next.orig/fs/bio.c	2010-10-25 00:02:39.000000000 +0800
> > +++ linux-next/fs/bio.c	2010-10-25 00:03:37.000000000 +0800
> > @@ -306,6 +306,7 @@ out_set:
> >  	bio->bi_flags |= idx << BIO_POOL_OFFSET;
> >  	bio->bi_max_vecs = nr_iovecs;
> >  	bio->bi_io_vec = bvl;
> > +	bio->bi_destructor = bs->bio_destructor;
> >  	return bio;
> >  
> >  err_free:
> > @@ -340,12 +341,7 @@ static void bio_fs_destructor(struct bio
> >   */
> >  struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
> >  {
> > -	struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
> > -
> > -	if (bio)
> > -		bio->bi_destructor = bio_fs_destructor;
> > -
> > -	return bio;
> > +	return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
> >  }
> >  EXPORT_SYMBOL(bio_alloc);
> >  
> > @@ -460,20 +456,21 @@ void __bio_clone(struct bio *bio, struct
> >  EXPORT_SYMBOL(__bio_clone);
> >  
> >  /**
> > - *	bio_clone	-	clone a bio
> > + *	bio_clone_bioset	-	clone a bio
> >   *	@bio: bio to clone
> >   *	@gfp_mask: allocation priority
> > + *	@bs: bio_set to allocate from
> >   *
> >   * 	Like __bio_clone, only also allocates the returned bio
> >   */
> > -struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> > +struct bio *
> > +bio_clone_bioset(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
> >  {
> > -	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
> > +	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
> >  
> >  	if (!b)
> >  		return NULL;
> >  
> > -	b->bi_destructor = bio_fs_destructor;
> >  	__bio_clone(b, bio);
> >  
> >  	if (bio_integrity(bio)) {
> > @@ -489,6 +486,19 @@ struct bio *bio_clone(struct bio *bio, g
> >  
> >  	return b;
> >  }
> > +EXPORT_SYMBOL(bio_clone_bioset);
> > +
> > +/**
> > + *	bio_clone	-	clone a bio
> > + *	@bio: bio to clone
> > + *	@gfp_mask: allocation priority
> > + *
> > + *	Like __bio_clone, only also allocates the returned bio
> > + */
> > +struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> > +{
> > +	return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
> > +}
> >  EXPORT_SYMBOL(bio_clone);
> >  
> >  /**
> > @@ -1664,6 +1674,7 @@ static int __init init_bio(void)
> >  	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
> >  	if (!fs_bio_set)
> >  		panic("bio: can't allocate bios\n");
> > +	fs_bio_set->bio_destructor = bio_fs_destructor;
> >  
> >  	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
> >  						     sizeof(struct bio_pair));
> > --- linux-next.orig/include/linux/bio.h	2010-10-25 00:02:40.000000000 +0800
> > +++ linux-next/include/linux/bio.h	2010-10-25 00:03:37.000000000 +0800
> > @@ -227,6 +227,7 @@ extern int bio_phys_segments(struct requ
> >  
> >  extern void __bio_clone(struct bio *, struct bio *);
> >  extern struct bio *bio_clone(struct bio *, gfp_t);
> > +extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *);
> >  
> >  extern void bio_init(struct bio *);
> >  
> > @@ -299,6 +300,7 @@ struct bio_set {
> >  	mempool_t *bio_integrity_pool;
> >  #endif
> >  	mempool_t *bvec_pool;
> > +	bio_destructor_t	*bio_destructor;
> >  };
> >  
> >  struct biovec_slab {
> > --- linux-next.orig/drivers/md/raid1.h	2010-10-25 00:02:40.000000000 +0800
> > +++ linux-next/drivers/md/raid1.h	2010-10-25 00:03:37.000000000 +0800
> > @@ -60,6 +60,8 @@ struct r1_private_data_s {
> >  	mempool_t *r1bio_pool;
> >  	mempool_t *r1buf_pool;
> >  
> > +	struct bio_set *r1_bio_set;
> > +
> >  	/* When taking over an array from a different personality, we store
> >  	 * the new thread here until we fully activate the array.
> >  	 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2010-10-25  7:26 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-14 23:11 Deadlock possibly caused by too_many_isolated Neil Brown
2010-09-14 23:11 ` Neil Brown
2010-09-15  0:30 ` Rik van Riel
2010-09-15  0:30   ` Rik van Riel
2010-09-15  2:23   ` Neil Brown
2010-09-15  2:23     ` Neil Brown
2010-09-15  2:37     ` Wu Fengguang
2010-09-15  2:37       ` Wu Fengguang
2010-09-15  2:54       ` Wu Fengguang
2010-09-15  2:54         ` Wu Fengguang
2010-09-15  3:06         ` Wu Fengguang
2010-09-15  3:06           ` Wu Fengguang
2010-09-15  3:13           ` Wu Fengguang
2010-09-15  3:13             ` Wu Fengguang
2010-09-15  3:18             ` Shaohua Li
2010-09-15  3:18               ` Shaohua Li
2010-09-15  3:31               ` Wu Fengguang
2010-09-15  3:31                 ` Wu Fengguang
2010-09-15  3:17           ` Neil Brown
2010-09-15  3:17             ` Neil Brown
2010-09-15  3:47             ` Wu Fengguang
2010-09-15  3:47               ` Wu Fengguang
2010-09-15  8:28     ` Wu Fengguang
2010-09-15  8:28       ` Wu Fengguang
2010-09-15  8:44       ` Neil Brown
2010-09-15  8:44         ` Neil Brown
2010-10-18  4:14         ` Neil Brown
2010-10-18  4:14           ` Neil Brown
2010-10-18  5:04           ` KOSAKI Motohiro
2010-10-18  5:04             ` KOSAKI Motohiro
2010-10-18 10:58           ` Torsten Kaiser
2010-10-18 10:58             ` Torsten Kaiser
2010-10-18 23:11             ` Neil Brown
2010-10-18 23:11               ` Neil Brown
2010-10-19  8:43               ` Torsten Kaiser
2010-10-19  8:43                 ` Torsten Kaiser
2010-10-19 10:06                 ` Torsten Kaiser
2010-10-19 10:06                   ` Torsten Kaiser
2010-10-20  5:57                   ` Wu Fengguang
2010-10-20  5:57                     ` Wu Fengguang
2010-10-20  7:05                     ` KOSAKI Motohiro
2010-10-20  7:05                       ` KOSAKI Motohiro
2010-10-20  9:27                       ` Wu Fengguang
2010-10-20  9:27                         ` Wu Fengguang
2010-10-20 13:03                         ` Jens Axboe
2010-10-20 13:03                           ` Jens Axboe
2010-10-22  5:37                           ` Wu Fengguang
2010-10-22  5:37                             ` Wu Fengguang
2010-10-22  8:07                             ` Wu Fengguang
2010-10-22  8:07                               ` Wu Fengguang
2010-10-22  8:09                               ` Jens Axboe
2010-10-22  8:09                                 ` Jens Axboe
2010-10-24 16:52                                 ` Wu Fengguang
2010-10-24 16:52                                   ` Wu Fengguang
2010-10-25  6:40                                   ` Neil Brown
2010-10-25  6:40                                     ` Neil Brown
2010-10-25  7:26                                     ` Wu Fengguang
2010-10-25  7:26                                       ` Wu Fengguang
2010-10-20  7:25                     ` Torsten Kaiser
2010-10-20  7:25                       ` Torsten Kaiser
2010-10-20  9:01                       ` Wu Fengguang
2010-10-20  9:01                         ` Wu Fengguang
2010-10-20 10:07                         ` Torsten Kaiser
2010-10-20 10:07                           ` Torsten Kaiser
2010-10-20 14:23                       ` Minchan Kim
2010-10-20 14:23                         ` Minchan Kim
2010-10-20 15:35                         ` Torsten Kaiser
2010-10-20 15:35                           ` Torsten Kaiser
2010-10-20 23:31                           ` Minchan Kim
2010-10-20 23:31                             ` Minchan Kim
2010-10-18 16:15           ` Wu Fengguang
2010-10-18 16:15             ` Wu Fengguang
2010-10-18 21:58             ` Andrew Morton
2010-10-18 21:58               ` Andrew Morton
2010-10-18 22:31               ` Neil Brown
2010-10-18 22:31                 ` Neil Brown
2010-10-18 22:41                 ` Andrew Morton
2010-10-18 22:41                   ` Andrew Morton
2010-10-19  0:57                   ` KOSAKI Motohiro
2010-10-19  0:57                     ` KOSAKI Motohiro
2010-10-19  1:15                     ` Minchan Kim
2010-10-19  1:15                       ` Minchan Kim
2010-10-19  1:21                       ` KOSAKI Motohiro
2010-10-19  1:21                         ` KOSAKI Motohiro
2010-10-19  1:32                         ` Minchan Kim
2010-10-19  1:32                           ` Minchan Kim
2010-10-19  2:03                           ` KOSAKI Motohiro
2010-10-19  2:03                             ` KOSAKI Motohiro
2010-10-19  2:16                             ` Minchan Kim
2010-10-19  2:16                               ` Minchan Kim
2010-10-19  2:54                               ` KOSAKI Motohiro
2010-10-19  2:54                                 ` KOSAKI Motohiro
2010-10-19  2:35                       ` Wu Fengguang
2010-10-19  2:35                         ` Wu Fengguang
2010-10-19  2:52                         ` Minchan Kim
2010-10-19  2:52                           ` Minchan Kim
2010-10-19  3:05                           ` Wu Fengguang
2010-10-19  3:05                             ` Wu Fengguang
2010-10-19  3:09                             ` Minchan Kim
2010-10-19  3:09                               ` Minchan Kim
2010-10-19  3:13                               ` KOSAKI Motohiro
2010-10-19  3:13                                 ` KOSAKI Motohiro
2010-10-19  5:11                                 ` Minchan Kim
2010-10-19  5:11                                   ` Minchan Kim
2010-10-19  3:21                               ` Shaohua Li
2010-10-19  3:21                                 ` Shaohua Li
2010-10-19  7:15                                 ` Shaohua Li
2010-10-19  7:15                                   ` Shaohua Li
2010-10-19  7:34                                   ` Minchan Kim
2010-10-19  7:34                                     ` Minchan Kim
2010-10-19  2:24                   ` Wu Fengguang
2010-10-19  2:24                     ` Wu Fengguang
2010-10-19  2:37                     ` KOSAKI Motohiro
2010-10-19  2:37                       ` KOSAKI Motohiro
2010-10-19  2:37                     ` Minchan Kim
2010-10-19  2:37                       ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.