linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
@ 2016-11-17 19:24 Douglas Anderson
  2016-11-17 20:28 ` Mike Snitzer
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Douglas Anderson @ 2016-11-17 19:24 UTC (permalink / raw)
  To: Alasdair Kergon, Mike Snitzer
  Cc: David Rientjes, Dmitry Torokhov, Sonny Rao, linux,
	Douglas Anderson, dm-devel, shli, linux-raid, linux-kernel

We've seen in-field reports showing _lots_ (18 in one case, 41 in
another) of tasks all sitting there blocked on:

  mutex_lock+0x4c/0x68
  dm_bufio_shrink_count+0x38/0x78
  shrink_slab.part.54.constprop.65+0x100/0x464
  shrink_zone+0xa8/0x198

In the two cases analyzed, we see one task that looks like this:

  Workqueue: kverityd verity_prefetch_io

  __switch_to+0x9c/0xa8
  __schedule+0x440/0x6d8
  schedule+0x94/0xb4
  schedule_timeout+0x204/0x27c
  schedule_timeout_uninterruptible+0x44/0x50
  wait_iff_congested+0x9c/0x1f0
  shrink_inactive_list+0x3a0/0x4cc
  shrink_lruvec+0x418/0x5cc
  shrink_zone+0x88/0x198
  try_to_free_pages+0x51c/0x588
  __alloc_pages_nodemask+0x648/0xa88
  __get_free_pages+0x34/0x7c
  alloc_buffer+0xa4/0x144
  __bufio_new+0x84/0x278
  dm_bufio_prefetch+0x9c/0x154
  verity_prefetch_io+0xe8/0x10c
  process_one_work+0x240/0x424
  worker_thread+0x2fc/0x424
  kthread+0x10c/0x114

...and that looks to be the one holding the mutex.

The problem has been reproduced on fairly easily:
0. Be running Chrome OS w/ verity enabled on the root filesystem
1. Pick test patch: http://crosreview.com/412360
2. Install launchBalloons.sh and balloon.arm from
     http://crbug.com/468342
   ...that's just a memory stress test app.
3. On a 4GB rk3399 machine, run
     nice ./launchBalloons.sh 4 900 100000
   ...that tries to eat 4 * 900 MB of memory and keep accessing.
4. Login to the Chrome web browser and restore many tabs

With that, I've seen printouts like:
  DOUG: long bufio 90758 ms
...and stack trace always show's we're in dm_bufio_prefetch().

The problem is that we try to allocate memory with GFP_NOIO while
we're holding the dm_bufio lock.  Instead we should be using
GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
lock and that causes the above problems.

The current behavior explained by David Rientjes:

  It will still try reclaim initially because __GFP_WAIT (or
  __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
  contention on dm_bufio_lock() that the thread holds.  You want to
  pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
  mutex that can be contended by a concurrent slab shrinker (if
  count_objects didn't use a trylock, this pattern would trivially
  deadlock).

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
Note that this change was developed and tested against the Chrome OS
4.4 kernel tree, not mainline.  Due to slight differences in verity
between mainline and Chrome OS it became too difficult to reproduce my
testing setup on mainline.  This patch still seems correct and
relevant to upstream, so I'm posting it.  If this is not acceptible to
you then please ignore this patch.

Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
reproduce the long delays described in the patch.  Presumably
something changed in either the kernel config or the memory management
code between the two kernel versions that made this crop up.  In a
similar vein, it is possible that problems described in this patch are
no longer reproducible upstream.  However, the arguments made in this
patch (that we don't want to block while holding the mutex) still
apply so I think the patch may still have merit.

 drivers/md/dm-bufio.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index b3ba142e59a4..3c767399cc59 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
 	 * dm-bufio is resistant to allocation failures (it just keeps
 	 * one buffer reserved in cases all the allocations fail).
 	 * So set flags to not try too hard:
-	 *	GFP_NOIO: don't recurse into the I/O layer
+	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
+	 *		    mutex and wait ourselves.
 	 *	__GFP_NORETRY: don't retry and rather return failure
 	 *	__GFP_NOMEMALLOC: don't use emergency reserves
 	 *	__GFP_NOWARN: don't print a warning in case of failure
@@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
 	 */
 	while (1) {
 		if (dm_bufio_cache_size_latch != 1) {
-			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
+					 __GFP_NOMEMALLOC | __GFP_NOWARN);
 			if (b)
 				return b;
 		}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: dm: Avoid sleeping while holding the dm_bufio lock
  2016-11-17 19:24 [PATCH] dm: Avoid sleeping while holding the dm_bufio lock Douglas Anderson
@ 2016-11-17 20:28 ` Mike Snitzer
  2016-11-17 20:44   ` Doug Anderson
  2016-11-17 21:56 ` [PATCH] " Guenter Roeck
  2016-11-23 20:57 ` Mikulas Patocka
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Snitzer @ 2016-11-17 20:28 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Alasdair Kergon, David Rientjes, Dmitry Torokhov, Sonny Rao,
	linux, dm-devel, shli, linux-raid, linux-kernel

On Thu, Nov 17 2016 at  2:24pm -0500,
Douglas Anderson <dianders@chromium.org> wrote:

> We've seen in-field reports showing _lots_ (18 in one case, 41 in
> another) of tasks all sitting there blocked on:
> 
>   mutex_lock+0x4c/0x68
>   dm_bufio_shrink_count+0x38/0x78
>   shrink_slab.part.54.constprop.65+0x100/0x464
>   shrink_zone+0xa8/0x198
> 
> In the two cases analyzed, we see one task that looks like this:
> 
>   Workqueue: kverityd verity_prefetch_io
> 
>   __switch_to+0x9c/0xa8
>   __schedule+0x440/0x6d8
>   schedule+0x94/0xb4
>   schedule_timeout+0x204/0x27c
>   schedule_timeout_uninterruptible+0x44/0x50
>   wait_iff_congested+0x9c/0x1f0
>   shrink_inactive_list+0x3a0/0x4cc
>   shrink_lruvec+0x418/0x5cc
>   shrink_zone+0x88/0x198
>   try_to_free_pages+0x51c/0x588
>   __alloc_pages_nodemask+0x648/0xa88
>   __get_free_pages+0x34/0x7c
>   alloc_buffer+0xa4/0x144
>   __bufio_new+0x84/0x278
>   dm_bufio_prefetch+0x9c/0x154
>   verity_prefetch_io+0xe8/0x10c
>   process_one_work+0x240/0x424
>   worker_thread+0x2fc/0x424
>   kthread+0x10c/0x114
> 
> ...and that looks to be the one holding the mutex.
> 
> The problem has been reproduced on fairly easily:
> 0. Be running Chrome OS w/ verity enabled on the root filesystem
> 1. Pick test patch: http://crosreview.com/412360
> 2. Install launchBalloons.sh and balloon.arm from
>      http://crbug.com/468342
>    ...that's just a memory stress test app.
> 3. On a 4GB rk3399 machine, run
>      nice ./launchBalloons.sh 4 900 100000
>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
> 4. Login to the Chrome web browser and restore many tabs
> 
> With that, I've seen printouts like:
>   DOUG: long bufio 90758 ms
> ...and stack trace always show's we're in dm_bufio_prefetch().
> 
> The problem is that we try to allocate memory with GFP_NOIO while
> we're holding the dm_bufio lock.  Instead we should be using
> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
> lock and that causes the above problems.
> 
> The current behavior explained by David Rientjes:
> 
>   It will still try reclaim initially because __GFP_WAIT (or
>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>   contention on dm_bufio_lock() that the thread holds.  You want to
>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>   mutex that can be contended by a concurrent slab shrinker (if
>   count_objects didn't use a trylock, this pattern would trivially
>   deadlock).
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> Note that this change was developed and tested against the Chrome OS
> 4.4 kernel tree, not mainline.  Due to slight differences in verity
> between mainline and Chrome OS it became too difficult to reproduce my
> testing setup on mainline.  This patch still seems correct and
> relevant to upstream, so I'm posting it.  If this is not acceptible to
> you then please ignore this patch.
> 
> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
> reproduce the long delays described in the patch.  Presumably
> something changed in either the kernel config or the memory management
> code between the two kernel versions that made this crop up.  In a
> similar vein, it is possible that problems described in this patch are
> no longer reproducible upstream.  However, the arguments made in this
> patch (that we don't want to block while holding the mutex) still
> apply so I think the patch may still have merit.
> 
>  drivers/md/dm-bufio.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index b3ba142e59a4..3c767399cc59 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 * dm-bufio is resistant to allocation failures (it just keeps
>  	 * one buffer reserved in cases all the allocations fail).
>  	 * So set flags to not try too hard:
> -	 *	GFP_NOIO: don't recurse into the I/O layer
> +	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
> +	 *		    mutex and wait ourselves.
>  	 *	__GFP_NORETRY: don't retry and rather return failure
>  	 *	__GFP_NOMEMALLOC: don't use emergency reserves
>  	 *	__GFP_NOWARN: don't print a warning in case of failure
> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 */
>  	while (1) {
>  		if (dm_bufio_cache_size_latch != 1) {
> -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
> +					 __GFP_NOMEMALLOC | __GFP_NOWARN);
>  			if (b)
>  				return b;
>  		}
> -- 
> 2.8.0.rc3.226.g39d4020
> 

I have one report of a very low-memory system hitting issues with bufio
(in the context of DM-thinp, due to bufio shrinker) but nothing
implicating alloc_buffer().

In any case, I'm fine with your patch given that we'll just retry.  BUT
spinning in __alloc_buffer_wait_no_callback() doesn't really change the
fact that you're starved for memory.  It just makes this less visible
right?  Meaning that you won't see hung task timeouts?  Or were you
seeing these tasks manifest this back-pressure through other means?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dm: Avoid sleeping while holding the dm_bufio lock
  2016-11-17 20:28 ` Mike Snitzer
@ 2016-11-17 20:44   ` Doug Anderson
  2016-11-17 20:48     ` Mike Snitzer
  0 siblings, 1 reply; 12+ messages in thread
From: Doug Anderson @ 2016-11-17 20:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Alasdair Kergon, David Rientjes, Dmitry Torokhov, Sonny Rao,
	Guenter Roeck, dm-devel, Shaohua Li, linux-raid, linux-kernel

Hi,

On Thu, Nov 17, 2016 at 12:28 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Thu, Nov 17 2016 at  2:24pm -0500,
> Douglas Anderson <dianders@chromium.org> wrote:
>
>> We've seen in-field reports showing _lots_ (18 in one case, 41 in
>> another) of tasks all sitting there blocked on:
>>
>>   mutex_lock+0x4c/0x68
>>   dm_bufio_shrink_count+0x38/0x78
>>   shrink_slab.part.54.constprop.65+0x100/0x464
>>   shrink_zone+0xa8/0x198
>>
>> In the two cases analyzed, we see one task that looks like this:
>>
>>   Workqueue: kverityd verity_prefetch_io
>>
>>   __switch_to+0x9c/0xa8
>>   __schedule+0x440/0x6d8
>>   schedule+0x94/0xb4
>>   schedule_timeout+0x204/0x27c
>>   schedule_timeout_uninterruptible+0x44/0x50
>>   wait_iff_congested+0x9c/0x1f0
>>   shrink_inactive_list+0x3a0/0x4cc
>>   shrink_lruvec+0x418/0x5cc
>>   shrink_zone+0x88/0x198
>>   try_to_free_pages+0x51c/0x588
>>   __alloc_pages_nodemask+0x648/0xa88
>>   __get_free_pages+0x34/0x7c
>>   alloc_buffer+0xa4/0x144
>>   __bufio_new+0x84/0x278
>>   dm_bufio_prefetch+0x9c/0x154
>>   verity_prefetch_io+0xe8/0x10c
>>   process_one_work+0x240/0x424
>>   worker_thread+0x2fc/0x424
>>   kthread+0x10c/0x114
>>
>> ...and that looks to be the one holding the mutex.
>>
>> The problem has been reproduced on fairly easily:
>> 0. Be running Chrome OS w/ verity enabled on the root filesystem
>> 1. Pick test patch: http://crosreview.com/412360
>> 2. Install launchBalloons.sh and balloon.arm from
>>      http://crbug.com/468342
>>    ...that's just a memory stress test app.
>> 3. On a 4GB rk3399 machine, run
>>      nice ./launchBalloons.sh 4 900 100000
>>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
>> 4. Login to the Chrome web browser and restore many tabs
>>
>> With that, I've seen printouts like:
>>   DOUG: long bufio 90758 ms
>> ...and stack trace always show's we're in dm_bufio_prefetch().
>>
>> The problem is that we try to allocate memory with GFP_NOIO while
>> we're holding the dm_bufio lock.  Instead we should be using
>> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
>> lock and that causes the above problems.
>>
>> The current behavior explained by David Rientjes:
>>
>>   It will still try reclaim initially because __GFP_WAIT (or
>>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>>   contention on dm_bufio_lock() that the thread holds.  You want to
>>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>>   mutex that can be contended by a concurrent slab shrinker (if
>>   count_objects didn't use a trylock, this pattern would trivially
>>   deadlock).
>>
>> Suggested-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>> Note that this change was developed and tested against the Chrome OS
>> 4.4 kernel tree, not mainline.  Due to slight differences in verity
>> between mainline and Chrome OS it became too difficult to reproduce my
>> testing setup on mainline.  This patch still seems correct and
>> relevant to upstream, so I'm posting it.  If this is not acceptible to
>> you then please ignore this patch.
>>
>> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
>> reproduce the long delays described in the patch.  Presumably
>> something changed in either the kernel config or the memory management
>> code between the two kernel versions that made this crop up.  In a
>> similar vein, it is possible that problems described in this patch are
>> no longer reproducible upstream.  However, the arguments made in this
>> patch (that we don't want to block while holding the mutex) still
>> apply so I think the patch may still have merit.
>>
>>  drivers/md/dm-bufio.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>> index b3ba142e59a4..3c767399cc59 100644
>> --- a/drivers/md/dm-bufio.c
>> +++ b/drivers/md/dm-bufio.c
>> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>        * dm-bufio is resistant to allocation failures (it just keeps
>>        * one buffer reserved in cases all the allocations fail).
>>        * So set flags to not try too hard:
>> -      *      GFP_NOIO: don't recurse into the I/O layer
>> +      *      GFP_NOWAIT: don't wait; if we need to sleep we'll release our
>> +      *                  mutex and wait ourselves.
>>        *      __GFP_NORETRY: don't retry and rather return failure
>>        *      __GFP_NOMEMALLOC: don't use emergency reserves
>>        *      __GFP_NOWARN: don't print a warning in case of failure
>> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>        */
>>       while (1) {
>>               if (dm_bufio_cache_size_latch != 1) {
>> -                     b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> +                     b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
>> +                                      __GFP_NOMEMALLOC | __GFP_NOWARN);
>>                       if (b)
>>                               return b;
>>               }
>> --
>> 2.8.0.rc3.226.g39d4020
>>
>
> I have one report of a very low-memory system hitting issues with bufio
> (in the context of DM-thinp, due to bufio shrinker) but nothing
> implicating alloc_buffer().
>
> In any case, I'm fine with your patch given that we'll just retry.  BUT
> spinning in __alloc_buffer_wait_no_callback() doesn't really change the
> fact that you're starved for memory.  It just makes this less visible
> right?  Meaning that you won't see hung task timeouts?  Or were you
> seeing these tasks manifest this back-pressure through other means?

It actually significantly increases responsiveness of the system while
in this state, so it makes a real difference.  I believe it actually
changes behavior because it (at least) unblocks kswapd.  In the bug
report I analyzed, I saw:

   kswapd0         D ffffffc000204fd8     0    72      2 0x00000000
   Call trace:
   [<ffffffc000204fd8>] __switch_to+0x9c/0xa8
   [<ffffffc00090b794>] __schedule+0x440/0x6d8
   [<ffffffc00090bac0>] schedule+0x94/0xb4
   [<ffffffc00090be44>] schedule_preempt_disabled+0x28/0x44
   [<ffffffc00090d900>] __mutex_lock_slowpath+0x120/0x1ac
   [<ffffffc00090d9d8>] mutex_lock+0x4c/0x68
   [<ffffffc000708e7c>] dm_bufio_shrink_count+0x38/0x78
   [<ffffffc00030b268>] shrink_slab.part.54.constprop.65+0x100/0x464
   [<ffffffc00030dbd8>] shrink_zone+0xa8/0x198
   [<ffffffc00030e578>] balance_pgdat+0x328/0x508
   [<ffffffc00030eb7c>] kswapd+0x424/0x51c
   [<ffffffc00023f06c>] kthread+0x10c/0x114
   [<ffffffc000203dd0>] ret_from_fork+0x10/0x40

I'm not an expert, but I believe that blocking swapd isn't a super
great idea and that if we unblock it (like my patch will) then that
can help alleviate memory pressure.


-Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dm: Avoid sleeping while holding the dm_bufio lock
  2016-11-17 20:44   ` Doug Anderson
@ 2016-11-17 20:48     ` Mike Snitzer
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2016-11-17 20:48 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Shaohua Li, Dmitry Torokhov, linux-kernel, linux-raid, dm-devel,
	Alasdair Kergon, David Rientjes, Sonny Rao, Guenter Roeck

On Thu, Nov 17 2016 at  3:44pm -0500,
Doug Anderson <dianders@chromium.org> wrote:

> Hi,
> 
> On Thu, Nov 17, 2016 at 12:28 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Thu, Nov 17 2016 at  2:24pm -0500,
> > Douglas Anderson <dianders@chromium.org> wrote:
> >
> >> We've seen in-field reports showing _lots_ (18 in one case, 41 in
> >> another) of tasks all sitting there blocked on:
> >>
> >>   mutex_lock+0x4c/0x68
> >>   dm_bufio_shrink_count+0x38/0x78
> >>   shrink_slab.part.54.constprop.65+0x100/0x464
> >>   shrink_zone+0xa8/0x198
> >>
> >> In the two cases analyzed, we see one task that looks like this:
> >>
> >>   Workqueue: kverityd verity_prefetch_io
> >>
> >>   __switch_to+0x9c/0xa8
> >>   __schedule+0x440/0x6d8
> >>   schedule+0x94/0xb4
> >>   schedule_timeout+0x204/0x27c
> >>   schedule_timeout_uninterruptible+0x44/0x50
> >>   wait_iff_congested+0x9c/0x1f0
> >>   shrink_inactive_list+0x3a0/0x4cc
> >>   shrink_lruvec+0x418/0x5cc
> >>   shrink_zone+0x88/0x198
> >>   try_to_free_pages+0x51c/0x588
> >>   __alloc_pages_nodemask+0x648/0xa88
> >>   __get_free_pages+0x34/0x7c
> >>   alloc_buffer+0xa4/0x144
> >>   __bufio_new+0x84/0x278
> >>   dm_bufio_prefetch+0x9c/0x154
> >>   verity_prefetch_io+0xe8/0x10c
> >>   process_one_work+0x240/0x424
> >>   worker_thread+0x2fc/0x424
> >>   kthread+0x10c/0x114
> >>
> >> ...and that looks to be the one holding the mutex.
> >>
> >> The problem has been reproduced on fairly easily:
> >> 0. Be running Chrome OS w/ verity enabled on the root filesystem
> >> 1. Pick test patch: http://crosreview.com/412360
> >> 2. Install launchBalloons.sh and balloon.arm from
> >>      http://crbug.com/468342
> >>    ...that's just a memory stress test app.
> >> 3. On a 4GB rk3399 machine, run
> >>      nice ./launchBalloons.sh 4 900 100000
> >>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
> >> 4. Login to the Chrome web browser and restore many tabs
> >>
> >> With that, I've seen printouts like:
> >>   DOUG: long bufio 90758 ms
> >> ...and stack trace always show's we're in dm_bufio_prefetch().
> >>
> >> The problem is that we try to allocate memory with GFP_NOIO while
> >> we're holding the dm_bufio lock.  Instead we should be using
> >> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
> >> lock and that causes the above problems.
> >>
> >> The current behavior explained by David Rientjes:
> >>
> >>   It will still try reclaim initially because __GFP_WAIT (or
> >>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
> >>   contention on dm_bufio_lock() that the thread holds.  You want to
> >>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
> >>   mutex that can be contended by a concurrent slab shrinker (if
> >>   count_objects didn't use a trylock, this pattern would trivially
> >>   deadlock).
> >>
> >> Suggested-by: David Rientjes <rientjes@google.com>
> >> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> >> ---
> >> Note that this change was developed and tested against the Chrome OS
> >> 4.4 kernel tree, not mainline.  Due to slight differences in verity
> >> between mainline and Chrome OS it became too difficult to reproduce my
> >> testing setup on mainline.  This patch still seems correct and
> >> relevant to upstream, so I'm posting it.  If this is not acceptible to
> >> you then please ignore this patch.
> >>
> >> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
> >> reproduce the long delays described in the patch.  Presumably
> >> something changed in either the kernel config or the memory management
> >> code between the two kernel versions that made this crop up.  In a
> >> similar vein, it is possible that problems described in this patch are
> >> no longer reproducible upstream.  However, the arguments made in this
> >> patch (that we don't want to block while holding the mutex) still
> >> apply so I think the patch may still have merit.
> >>
> >>  drivers/md/dm-bufio.c | 6 ++++--
> >>  1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> >> index b3ba142e59a4..3c767399cc59 100644
> >> --- a/drivers/md/dm-bufio.c
> >> +++ b/drivers/md/dm-bufio.c
> >> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> >>        * dm-bufio is resistant to allocation failures (it just keeps
> >>        * one buffer reserved in cases all the allocations fail).
> >>        * So set flags to not try too hard:
> >> -      *      GFP_NOIO: don't recurse into the I/O layer
> >> +      *      GFP_NOWAIT: don't wait; if we need to sleep we'll release our
> >> +      *                  mutex and wait ourselves.
> >>        *      __GFP_NORETRY: don't retry and rather return failure
> >>        *      __GFP_NOMEMALLOC: don't use emergency reserves
> >>        *      __GFP_NOWARN: don't print a warning in case of failure
> >> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> >>        */
> >>       while (1) {
> >>               if (dm_bufio_cache_size_latch != 1) {
> >> -                     b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> >> +                     b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
> >> +                                      __GFP_NOMEMALLOC | __GFP_NOWARN);
> >>                       if (b)
> >>                               return b;
> >>               }
> >> --
> >> 2.8.0.rc3.226.g39d4020
> >>
> >
> > I have one report of a very low-memory system hitting issues with bufio
> > (in the context of DM-thinp, due to bufio shrinker) but nothing
> > implicating alloc_buffer().
> >
> > In any case, I'm fine with your patch given that we'll just retry.  BUT
> > spinning in __alloc_buffer_wait_no_callback() doesn't really change the
> > fact that you're starved for memory.  It just makes this less visible
> > right?  Meaning that you won't see hung task timeouts?  Or were you
> > seeing these tasks manifest this back-pressure through other means?
> 
> It actually significantly increases responsiveness of the system while
> in this state, so it makes a real difference.  I believe it actually
> changes behavior because it (at least) unblocks kswapd.  In the bug
> report I analyzed, I saw:
> 
>    kswapd0         D ffffffc000204fd8     0    72      2 0x00000000
>    Call trace:
>    [<ffffffc000204fd8>] __switch_to+0x9c/0xa8
>    [<ffffffc00090b794>] __schedule+0x440/0x6d8
>    [<ffffffc00090bac0>] schedule+0x94/0xb4
>    [<ffffffc00090be44>] schedule_preempt_disabled+0x28/0x44
>    [<ffffffc00090d900>] __mutex_lock_slowpath+0x120/0x1ac
>    [<ffffffc00090d9d8>] mutex_lock+0x4c/0x68
>    [<ffffffc000708e7c>] dm_bufio_shrink_count+0x38/0x78
>    [<ffffffc00030b268>] shrink_slab.part.54.constprop.65+0x100/0x464
>    [<ffffffc00030dbd8>] shrink_zone+0xa8/0x198
>    [<ffffffc00030e578>] balance_pgdat+0x328/0x508
>    [<ffffffc00030eb7c>] kswapd+0x424/0x51c
>    [<ffffffc00023f06c>] kthread+0x10c/0x114
>    [<ffffffc000203dd0>] ret_from_fork+0x10/0x40
> 
> I'm not an expert, but I believe that blocking swapd isn't a super
> great idea and that if we unblock it (like my patch will) then that
> can help alleviate memory pressure.

OK, thanks for clarifying.  I'll get it staged for 4.10 this week.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
  2016-11-17 19:24 [PATCH] dm: Avoid sleeping while holding the dm_bufio lock Douglas Anderson
  2016-11-17 20:28 ` Mike Snitzer
@ 2016-11-17 21:56 ` Guenter Roeck
  2016-11-23 20:57 ` Mikulas Patocka
  2 siblings, 0 replies; 12+ messages in thread
From: Guenter Roeck @ 2016-11-17 21:56 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Alasdair Kergon, Mike Snitzer, David Rientjes, Dmitry Torokhov,
	Sonny Rao, dm-devel, shli, linux-raid, linux-kernel

On Thu, Nov 17, 2016 at 11:24:20AM -0800, Douglas Anderson wrote:
> We've seen in-field reports showing _lots_ (18 in one case, 41 in
> another) of tasks all sitting there blocked on:
> 
>   mutex_lock+0x4c/0x68
>   dm_bufio_shrink_count+0x38/0x78
>   shrink_slab.part.54.constprop.65+0x100/0x464
>   shrink_zone+0xa8/0x198
> 
> In the two cases analyzed, we see one task that looks like this:
> 
>   Workqueue: kverityd verity_prefetch_io
> 
>   __switch_to+0x9c/0xa8
>   __schedule+0x440/0x6d8
>   schedule+0x94/0xb4
>   schedule_timeout+0x204/0x27c
>   schedule_timeout_uninterruptible+0x44/0x50
>   wait_iff_congested+0x9c/0x1f0
>   shrink_inactive_list+0x3a0/0x4cc
>   shrink_lruvec+0x418/0x5cc
>   shrink_zone+0x88/0x198
>   try_to_free_pages+0x51c/0x588
>   __alloc_pages_nodemask+0x648/0xa88
>   __get_free_pages+0x34/0x7c
>   alloc_buffer+0xa4/0x144
>   __bufio_new+0x84/0x278
>   dm_bufio_prefetch+0x9c/0x154
>   verity_prefetch_io+0xe8/0x10c
>   process_one_work+0x240/0x424
>   worker_thread+0x2fc/0x424
>   kthread+0x10c/0x114
> 
> ...and that looks to be the one holding the mutex.
> 
> The problem has been reproduced on fairly easily:
> 0. Be running Chrome OS w/ verity enabled on the root filesystem
> 1. Pick test patch: http://crosreview.com/412360
> 2. Install launchBalloons.sh and balloon.arm from
>      http://crbug.com/468342
>    ...that's just a memory stress test app.
> 3. On a 4GB rk3399 machine, run
>      nice ./launchBalloons.sh 4 900 100000
>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
> 4. Login to the Chrome web browser and restore many tabs
> 
> With that, I've seen printouts like:
>   DOUG: long bufio 90758 ms
> ...and stack trace always show's we're in dm_bufio_prefetch().
> 
> The problem is that we try to allocate memory with GFP_NOIO while
> we're holding the dm_bufio lock.  Instead we should be using
> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
> lock and that causes the above problems.
> 
> The current behavior explained by David Rientjes:
> 
>   It will still try reclaim initially because __GFP_WAIT (or
>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>   contention on dm_bufio_lock() that the thread holds.  You want to
>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>   mutex that can be contended by a concurrent slab shrinker (if
>   count_objects didn't use a trylock, this pattern would trivially
>   deadlock).
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

Reviewed-by: Guenter Roeck <linux@roeck-us.net>

> ---
> Note that this change was developed and tested against the Chrome OS
> 4.4 kernel tree, not mainline.  Due to slight differences in verity
> between mainline and Chrome OS it became too difficult to reproduce my
> testing setup on mainline.  This patch still seems correct and
> relevant to upstream, so I'm posting it.  If this is not acceptible to
> you then please ignore this patch.
> 
> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
> reproduce the long delays described in the patch.  Presumably
> something changed in either the kernel config or the memory management
> code between the two kernel versions that made this crop up.  In a
> similar vein, it is possible that problems described in this patch are
> no longer reproducible upstream.  However, the arguments made in this
> patch (that we don't want to block while holding the mutex) still
> apply so I think the patch may still have merit.
> 
>  drivers/md/dm-bufio.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index b3ba142e59a4..3c767399cc59 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 * dm-bufio is resistant to allocation failures (it just keeps
>  	 * one buffer reserved in cases all the allocations fail).
>  	 * So set flags to not try too hard:
> -	 *	GFP_NOIO: don't recurse into the I/O layer
> +	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
> +	 *		    mutex and wait ourselves.
>  	 *	__GFP_NORETRY: don't retry and rather return failure
>  	 *	__GFP_NOMEMALLOC: don't use emergency reserves
>  	 *	__GFP_NOWARN: don't print a warning in case of failure
> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 */
>  	while (1) {
>  		if (dm_bufio_cache_size_latch != 1) {
> -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
> +					 __GFP_NOMEMALLOC | __GFP_NOWARN);
>  			if (b)
>  				return b;
>  		}
> -- 
> 2.8.0.rc3.226.g39d4020
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
  2016-11-17 19:24 [PATCH] dm: Avoid sleeping while holding the dm_bufio lock Douglas Anderson
  2016-11-17 20:28 ` Mike Snitzer
  2016-11-17 21:56 ` [PATCH] " Guenter Roeck
@ 2016-11-23 20:57 ` Mikulas Patocka
  2016-12-08  0:54   ` Doug Anderson
  2016-12-15  0:53   ` Doug Anderson
  2 siblings, 2 replies; 12+ messages in thread
From: Mikulas Patocka @ 2016-11-23 20:57 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Alasdair Kergon, Mike Snitzer, shli, Dmitry Torokhov,
	linux-kernel, linux-raid, dm-devel, David Rientjes, Sonny Rao,
	linux

Hi

The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT 
allocation doesn't. Your patch would incorrectly reuse buffers in a 
situation when the memory is filled with clean cached pages.

Here I'm proposing an alternate patch that first tries GFP_NOWAIT 
allocation, then drops the lock and tries GFP_NOIO allocation.

Note that the root cause why you are seeing this stacktrace is, that your 
block device is congested - i.e. there are too many requests in the 
device's queue - and note that fixing this wait won't fix the root cause 
(congested device).

The congestion limits are set in blk_queue_congestion_threshold to 7/8 to 
13/16 size of the nr_requests value.

If you don't want your device to report the congested status, you can 
increase /sys/block/<device>/queue/nr_requests - you should test if your 
chromebook is faster of slower with this setting increased. But note that 
this setting won't increase the IO-per-second of the device.

Mikulas


On Thu, 17 Nov 2016, Douglas Anderson wrote:

> We've seen in-field reports showing _lots_ (18 in one case, 41 in
> another) of tasks all sitting there blocked on:
> 
>   mutex_lock+0x4c/0x68
>   dm_bufio_shrink_count+0x38/0x78
>   shrink_slab.part.54.constprop.65+0x100/0x464
>   shrink_zone+0xa8/0x198
> 
> In the two cases analyzed, we see one task that looks like this:
> 
>   Workqueue: kverityd verity_prefetch_io
> 
>   __switch_to+0x9c/0xa8
>   __schedule+0x440/0x6d8
>   schedule+0x94/0xb4
>   schedule_timeout+0x204/0x27c
>   schedule_timeout_uninterruptible+0x44/0x50
>   wait_iff_congested+0x9c/0x1f0
>   shrink_inactive_list+0x3a0/0x4cc
>   shrink_lruvec+0x418/0x5cc
>   shrink_zone+0x88/0x198
>   try_to_free_pages+0x51c/0x588
>   __alloc_pages_nodemask+0x648/0xa88
>   __get_free_pages+0x34/0x7c
>   alloc_buffer+0xa4/0x144
>   __bufio_new+0x84/0x278
>   dm_bufio_prefetch+0x9c/0x154
>   verity_prefetch_io+0xe8/0x10c
>   process_one_work+0x240/0x424
>   worker_thread+0x2fc/0x424
>   kthread+0x10c/0x114
> 
> ...and that looks to be the one holding the mutex.
> 
> The problem has been reproduced on fairly easily:
> 0. Be running Chrome OS w/ verity enabled on the root filesystem
> 1. Pick test patch: http://crosreview.com/412360
> 2. Install launchBalloons.sh and balloon.arm from
>      http://crbug.com/468342
>    ...that's just a memory stress test app.
> 3. On a 4GB rk3399 machine, run
>      nice ./launchBalloons.sh 4 900 100000
>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
> 4. Login to the Chrome web browser and restore many tabs
> 
> With that, I've seen printouts like:
>   DOUG: long bufio 90758 ms
> ...and stack trace always show's we're in dm_bufio_prefetch().
> 
> The problem is that we try to allocate memory with GFP_NOIO while
> we're holding the dm_bufio lock.  Instead we should be using
> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
> lock and that causes the above problems.
> 
> The current behavior explained by David Rientjes:
> 
>   It will still try reclaim initially because __GFP_WAIT (or
>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>   contention on dm_bufio_lock() that the thread holds.  You want to
>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>   mutex that can be contended by a concurrent slab shrinker (if
>   count_objects didn't use a trylock, this pattern would trivially
>   deadlock).
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> Note that this change was developed and tested against the Chrome OS
> 4.4 kernel tree, not mainline.  Due to slight differences in verity
> between mainline and Chrome OS it became too difficult to reproduce my
> testing setup on mainline.  This patch still seems correct and
> relevant to upstream, so I'm posting it.  If this is not acceptible to
> you then please ignore this patch.
> 
> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
> reproduce the long delays described in the patch.  Presumably
> something changed in either the kernel config or the memory management
> code between the two kernel versions that made this crop up.  In a
> similar vein, it is possible that problems described in this patch are
> no longer reproducible upstream.  However, the arguments made in this
> patch (that we don't want to block while holding the mutex) still
> apply so I think the patch may still have merit.
> 
>  drivers/md/dm-bufio.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index b3ba142e59a4..3c767399cc59 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 * dm-bufio is resistant to allocation failures (it just keeps
>  	 * one buffer reserved in cases all the allocations fail).
>  	 * So set flags to not try too hard:
> -	 *	GFP_NOIO: don't recurse into the I/O layer
> +	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
> +	 *		    mutex and wait ourselves.
>  	 *	__GFP_NORETRY: don't retry and rather return failure
>  	 *	__GFP_NOMEMALLOC: don't use emergency reserves
>  	 *	__GFP_NOWARN: don't print a warning in case of failure
> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 */
>  	while (1) {
>  		if (dm_bufio_cache_size_latch != 1) {
> -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
> +					 __GFP_NOMEMALLOC | __GFP_NOWARN);
>  			if (b)
>  				return b;
>  		}
> -- 
> 2.8.0.rc3.226.g39d4020
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 

From: Mikulas Patocka <mpatocka@redhat.com>

Subject: dm-bufio: drop the lock when doing GFP_NOIO alloaction

Drop the lock when doing GFP_NOIO alloaction beacuse the allocation can
take some time.

Note that we won't do GFP_NOIO allocation when we loop for the second
time, because the lock shouldn't be dropped between __wait_for_free_buffer
and __get_unclaimed_buffer.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-bufio.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/md/dm-bufio.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-bufio.c
+++ linux-2.6/drivers/md/dm-bufio.c
@@ -822,11 +822,13 @@ enum new_flag {
 static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
 {
 	struct dm_buffer *b;
+	bool tried_noio_alloc = false;
 
 	/*
 	 * dm-bufio is resistant to allocation failures (it just keeps
 	 * one buffer reserved in cases all the allocations fail).
 	 * So set flags to not try too hard:
+	 *	GFP_NOWAIT: don't sleep and don't release cache
 	 *	GFP_NOIO: don't recurse into the I/O layer
 	 *	__GFP_NORETRY: don't retry and rather return failure
 	 *	__GFP_NOMEMALLOC: don't use emergency reserves
@@ -837,7 +839,7 @@ static struct dm_buffer *__alloc_buffer_
 	 */
 	while (1) {
 		if (dm_bufio_cache_size_latch != 1) {
-			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
 			if (b)
 				return b;
 		}
@@ -845,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_
 		if (nf == NF_PREFETCH)
 			return NULL;
 
+		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
+			dm_bufio_unlock(c);
+			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			dm_bufio_lock(c);
+			if (b)
+				return b;
+			tried_noio_alloc = true;
+		}
+
 		if (!list_empty(&c->reserved_buffers)) {
 			b = list_entry(c->reserved_buffers.next,
 				       struct dm_buffer, lru_list);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
  2016-11-23 20:57 ` Mikulas Patocka
@ 2016-12-08  0:54   ` Doug Anderson
  2016-12-08 23:20     ` Mikulas Patocka
  2016-12-15  0:53   ` Doug Anderson
  1 sibling, 1 reply; 12+ messages in thread
From: Doug Anderson @ 2016-12-08  0:54 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
	linux-kernel, linux-raid, dm-devel, David Rientjes, Sonny Rao,
	Guenter Roeck

Hi,

On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> Hi
>
> The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
> allocation doesn't. Your patch would incorrectly reuse buffers in a
> situation when the memory is filled with clean cached pages.
>
> Here I'm proposing an alternate patch that first tries GFP_NOWAIT
> allocation, then drops the lock and tries GFP_NOIO allocation.
>
> Note that the root cause why you are seeing this stacktrace is, that your
> block device is congested - i.e. there are too many requests in the
> device's queue - and note that fixing this wait won't fix the root cause
> (congested device).
>
> The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
> 13/16 size of the nr_requests value.
>
> If you don't want your device to report the congested status, you can
> increase /sys/block/<device>/queue/nr_requests - you should test if your
> chromebook is faster of slower with this setting increased. But note that
> this setting won't increase the IO-per-second of the device.

Cool, thanks for the insight!

Can you clarify which block device is relevant here?  Is this the DM
block device, the underlying block device, or the swap block device?
I'm not at all an expert on DM, but I think we have:

1. /dev/mmcblk0 - the underlying storage device.
2. /dev/dm-0 - The verity device that's run atop /dev/mmcblk0p3
3. /dev/zram0 - Our swap device

As stated in the original email, I'm running on a downstream kernel
(kernel-4.4) with bunches of local patches, so it's plausible that
things have changed in the meantime, but:

* At boot time the "nr_requests" for all block devices was 128
* I was unable to set the "nr_requests" for dm-0 and zram0 (it just
gives an error in sysfs).
* When I set "nr_requests" to 4096 for /dev/mmcblk0 it didn't seem to
affect the problem.


> Mikulas
>
>
> On Thu, 17 Nov 2016, Douglas Anderson wrote:
>
>> We've seen in-field reports showing _lots_ (18 in one case, 41 in
>> another) of tasks all sitting there blocked on:
>>
>>   mutex_lock+0x4c/0x68
>>   dm_bufio_shrink_count+0x38/0x78
>>   shrink_slab.part.54.constprop.65+0x100/0x464
>>   shrink_zone+0xa8/0x198
>>
>> In the two cases analyzed, we see one task that looks like this:
>>
>>   Workqueue: kverityd verity_prefetch_io
>>
>>   __switch_to+0x9c/0xa8
>>   __schedule+0x440/0x6d8
>>   schedule+0x94/0xb4
>>   schedule_timeout+0x204/0x27c
>>   schedule_timeout_uninterruptible+0x44/0x50
>>   wait_iff_congested+0x9c/0x1f0
>>   shrink_inactive_list+0x3a0/0x4cc
>>   shrink_lruvec+0x418/0x5cc
>>   shrink_zone+0x88/0x198
>>   try_to_free_pages+0x51c/0x588
>>   __alloc_pages_nodemask+0x648/0xa88
>>   __get_free_pages+0x34/0x7c
>>   alloc_buffer+0xa4/0x144
>>   __bufio_new+0x84/0x278
>>   dm_bufio_prefetch+0x9c/0x154
>>   verity_prefetch_io+0xe8/0x10c
>>   process_one_work+0x240/0x424
>>   worker_thread+0x2fc/0x424
>>   kthread+0x10c/0x114
>>
>> ...and that looks to be the one holding the mutex.
>>
>> The problem has been reproduced on fairly easily:
>> 0. Be running Chrome OS w/ verity enabled on the root filesystem
>> 1. Pick test patch: http://crosreview.com/412360
>> 2. Install launchBalloons.sh and balloon.arm from
>>      http://crbug.com/468342
>>    ...that's just a memory stress test app.
>> 3. On a 4GB rk3399 machine, run
>>      nice ./launchBalloons.sh 4 900 100000
>>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
>> 4. Login to the Chrome web browser and restore many tabs
>>
>> With that, I've seen printouts like:
>>   DOUG: long bufio 90758 ms
>> ...and stack trace always show's we're in dm_bufio_prefetch().
>>
>> The problem is that we try to allocate memory with GFP_NOIO while
>> we're holding the dm_bufio lock.  Instead we should be using
>> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
>> lock and that causes the above problems.
>>
>> The current behavior explained by David Rientjes:
>>
>>   It will still try reclaim initially because __GFP_WAIT (or
>>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>>   contention on dm_bufio_lock() that the thread holds.  You want to
>>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>>   mutex that can be contended by a concurrent slab shrinker (if
>>   count_objects didn't use a trylock, this pattern would trivially
>>   deadlock).
>>
>> Suggested-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>> Note that this change was developed and tested against the Chrome OS
>> 4.4 kernel tree, not mainline.  Due to slight differences in verity
>> between mainline and Chrome OS it became too difficult to reproduce my
>> testing setup on mainline.  This patch still seems correct and
>> relevant to upstream, so I'm posting it.  If this is not acceptible to
>> you then please ignore this patch.
>>
>> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
>> reproduce the long delays described in the patch.  Presumably
>> something changed in either the kernel config or the memory management
>> code between the two kernel versions that made this crop up.  In a
>> similar vein, it is possible that problems described in this patch are
>> no longer reproducible upstream.  However, the arguments made in this
>> patch (that we don't want to block while holding the mutex) still
>> apply so I think the patch may still have merit.
>>
>>  drivers/md/dm-bufio.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>> index b3ba142e59a4..3c767399cc59 100644
>> --- a/drivers/md/dm-bufio.c
>> +++ b/drivers/md/dm-bufio.c
>> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>        * dm-bufio is resistant to allocation failures (it just keeps
>>        * one buffer reserved in cases all the allocations fail).
>>        * So set flags to not try too hard:
>> -      *      GFP_NOIO: don't recurse into the I/O layer
>> +      *      GFP_NOWAIT: don't wait; if we need to sleep we'll release our
>> +      *                  mutex and wait ourselves.
>>        *      __GFP_NORETRY: don't retry and rather return failure
>>        *      __GFP_NOMEMALLOC: don't use emergency reserves
>>        *      __GFP_NOWARN: don't print a warning in case of failure
>> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>        */
>>       while (1) {
>>               if (dm_bufio_cache_size_latch != 1) {
>> -                     b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> +                     b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
>> +                                      __GFP_NOMEMALLOC | __GFP_NOWARN);
>>                       if (b)
>>                               return b;
>>               }
>> --
>> 2.8.0.rc3.226.g39d4020
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> Subject: dm-bufio: drop the lock when doing GFP_NOIO alloaction
>
> Drop the lock when doing GFP_NOIO alloaction beacuse the allocation can
> take some time.
>
> Note that we won't do GFP_NOIO allocation when we loop for the second
> time, because the lock shouldn't be dropped between __wait_for_free_buffer
> and __get_unclaimed_buffer.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  drivers/md/dm-bufio.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/drivers/md/dm-bufio.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-bufio.c
> +++ linux-2.6/drivers/md/dm-bufio.c
> @@ -822,11 +822,13 @@ enum new_flag {
>  static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
>  {
>         struct dm_buffer *b;
> +       bool tried_noio_alloc = false;
>
>         /*
>          * dm-bufio is resistant to allocation failures (it just keeps
>          * one buffer reserved in cases all the allocations fail).
>          * So set flags to not try too hard:
> +        *      GFP_NOWAIT: don't sleep and don't release cache
>          *      GFP_NOIO: don't recurse into the I/O layer
>          *      __GFP_NORETRY: don't retry and rather return failure
>          *      __GFP_NOMEMALLOC: don't use emergency reserves
> @@ -837,7 +839,7 @@ static struct dm_buffer *__alloc_buffer_
>          */
>         while (1) {
>                 if (dm_bufio_cache_size_latch != 1) {
> -                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +                       b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>                         if (b)
>                                 return b;
>                 }
> @@ -845,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_
>                 if (nf == NF_PREFETCH)
>                         return NULL;
>
> +               if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> +                       dm_bufio_unlock(c);
> +                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +                       dm_bufio_lock(c);
> +                       if (b)
> +                               return b;
> +                       tried_noio_alloc = true;
> +               }
> +
>                 if (!list_empty(&c->reserved_buffers)) {
>                         b = list_entry(c->reserved_buffers.next,
>                                        struct dm_buffer, lru_list);

I agree that I believe that it is safe to drop and re-grab the
dm_bufio lock in this function.  I also believe it to be safe to call
alloc_buffer() without holding the dm_bufio lock.

That means that this looks fine to me.  It also fixes the test case
that I have.  ...so for what it's worth:

Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>

-Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
  2016-12-08  0:54   ` Doug Anderson
@ 2016-12-08 23:20     ` Mikulas Patocka
  2016-12-13  0:08       ` Doug Anderson
  0 siblings, 1 reply; 12+ messages in thread
From: Mikulas Patocka @ 2016-12-08 23:20 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
	linux-kernel, linux-raid, dm-devel, David Rientjes, Sonny Rao,
	Guenter Roeck



On Wed, 7 Dec 2016, Doug Anderson wrote:

> Hi,
> 
> On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > Hi
> >
> > The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
> > allocation doesn't. Your patch would incorrectly reuse buffers in a
> > situation when the memory is filled with clean cached pages.
> >
> > Here I'm proposing an alternate patch that first tries GFP_NOWAIT
> > allocation, then drops the lock and tries GFP_NOIO allocation.
> >
> > Note that the root cause why you are seeing this stacktrace is, that your
> > block device is congested - i.e. there are too many requests in the
> > device's queue - and note that fixing this wait won't fix the root cause
> > (congested device).
> >
> > The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
> > 13/16 size of the nr_requests value.
> >
> > If you don't want your device to report the congested status, you can
> > increase /sys/block/<device>/queue/nr_requests - you should test if your
> > chromebook is faster of slower with this setting increased. But note that
> > this setting won't increase the IO-per-second of the device.
> 
> Cool, thanks for the insight!
> 
> Can you clarify which block device is relevant here?  Is this the DM
> block device, the underlying block device, or the swap block device?
> I'm not at all an expert on DM, but I think we have:
> 
> 1. /dev/mmcblk0 - the underlying storage device.
> 2. /dev/dm-0 - The verity device that's run atop /dev/mmcblk0p3
> 3. /dev/zram0 - Our swap device

The /dev/mmcblk0 device is congested. You can see the number of requests 
in /sys/block/mmcblk0/inflight

> As stated in the original email, I'm running on a downstream kernel
> (kernel-4.4) with bunches of local patches, so it's plausible that
> things have changed in the meantime, but:
> 
> * At boot time the "nr_requests" for all block devices was 128
> * I was unable to set the "nr_requests" for dm-0 and zram0 (it just
> gives an error in sysfs).
> * When I set "nr_requests" to 4096 for /dev/mmcblk0 it didn't seem to
> affect the problem.

The eMMC has some IOPS and the IOPS can't be improved. Use faster block 
device - but it will cost more money.

If you want to handle such a situation where you run 4 tasks each eating 
900MB, just use more memory, don't expect that this will work smoothly on 
4GB machine.

If you want to protect the chromebook from runaway memory allocations, you 
can detect this situation in some watchdog process and either kill the 
process that consumes most memory with the kill syscall or trigger the 
kernel OOM killer by writing 'f' to /proc/sysrq-trigger.

The question is what you really want - handle this situation smoothly 
(then, you must add more memory) or protect chromeOS from applications 
allocating too much memory?

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
  2016-12-08 23:20     ` Mikulas Patocka
@ 2016-12-13  0:08       ` Doug Anderson
  2016-12-13 22:01         ` Doug Anderson
  0 siblings, 1 reply; 12+ messages in thread
From: Doug Anderson @ 2016-12-13  0:08 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
	linux-kernel, linux-raid, dm-devel, David Rientjes, Sonny Rao,
	Guenter Roeck

Hi,

On Thu, Dec 8, 2016 at 3:20 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Wed, 7 Dec 2016, Doug Anderson wrote:
>
>> Hi,
>>
>> On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > Hi
>> >
>> > The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
>> > allocation doesn't. Your patch would incorrectly reuse buffers in a
>> > situation when the memory is filled with clean cached pages.
>> >
>> > Here I'm proposing an alternate patch that first tries GFP_NOWAIT
>> > allocation, then drops the lock and tries GFP_NOIO allocation.
>> >
>> > Note that the root cause why you are seeing this stacktrace is, that your
>> > block device is congested - i.e. there are too many requests in the
>> > device's queue - and note that fixing this wait won't fix the root cause
>> > (congested device).
>> >
>> > The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
>> > 13/16 size of the nr_requests value.
>> >
>> > If you don't want your device to report the congested status, you can
>> > increase /sys/block/<device>/queue/nr_requests - you should test if your
>> > chromebook is faster of slower with this setting increased. But note that
>> > this setting won't increase the IO-per-second of the device.
>>
>> Cool, thanks for the insight!
>>
>> Can you clarify which block device is relevant here?  Is this the DM
>> block device, the underlying block device, or the swap block device?
>> I'm not at all an expert on DM, but I think we have:
>>
>> 1. /dev/mmcblk0 - the underlying storage device.
>> 2. /dev/dm-0 - The verity device that's run atop /dev/mmcblk0p3
>> 3. /dev/zram0 - Our swap device
>
> The /dev/mmcblk0 device is congested. You can see the number of requests
> in /sys/block/mmcblk0/inflight

Hrm.  Doing some tests doesn't show that to be true.

I ran this command while doing my test (the "balloon" test described
in my patch, not the running the computer as a real user):

  while true;
    do grep -v '0 *0' /sys/block/*/inflight;
    sleep .1;
  done

The largest numbers I ever saw was:

  /sys/block/mmcblk0/inflight:       6        0

...and in one case when I saw a "long bufio" mesage I was seeing:

/sys/block/zram0/inflight:       1        0
/sys/block/zram0/inflight:       0        1
/sys/block/zram0/inflight:       0        1
/sys/block/zram0/inflight:       1        1
/sys/block/zram0/inflight:       0        1


>> As stated in the original email, I'm running on a downstream kernel
>> (kernel-4.4) with bunches of local patches, so it's plausible that
>> things have changed in the meantime, but:
>>
>> * At boot time the "nr_requests" for all block devices was 128
>> * I was unable to set the "nr_requests" for dm-0 and zram0 (it just
>> gives an error in sysfs).
>> * When I set "nr_requests" to 4096 for /dev/mmcblk0 it didn't seem to
>> affect the problem.
>
> The eMMC has some IOPS and the IOPS can't be improved. Use faster block
> device - but it will cost more money.

It's so strange since when I see this failure I'm not even doing lots
of eMMC traffic.  As per above my swap goes to zram and there's no
disk backing, so there should be very minimal disk usage.

If I run iostat now after having been running my test for a while (and
having seen a bunch of "long bufio" messages, you can see that there
hasn't exactly been lots of activity.

localhost debug # iostat
Linux 4.4.21 (localhost)        12/09/16        _aarch64_       (6 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.11   33.87   36.63    1.78    0.00   21.61

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             2.74        39.01        68.95      33529      59264
loop1             0.07         1.23         0.00       1055          0
loop2             0.04         0.13         0.00        112          0
loop3             0.00         0.00         0.00          4          0
loop4             0.00         0.01         0.00          8          0
loop5             0.00         0.01         0.00          8          0
mmcblk1           0.56        44.45         0.00      38202          0
mmcblk0          14.37       636.29        38.74     546888      33300
mmcblk0rpmb       0.01         0.03         0.00         28          0
mmcblk0boot1      0.03         0.15         0.00        128          0
mmcblk0boot0      0.03         0.15         0.00        128          0
dm-0             11.50       419.25         0.00     360344          0
dm-1              2.15        36.53        68.95      31401      59264
zram0         42103.26     84111.64     84301.42   72293956   72457068

---

OK, so I just put a printk in wait_iff_congested() and it didn't show
me waiting for the timeout (!).  I know that I saw
wait_iff_congested() in the originally reproduction of this problem,
but it appears that in my little "balloon" reproduction it's not
actually involved...


...I dug further and it appears that __alloc_pages_direct_reclaim() is
actually what's slow.  Specifically it looks as if shrink_zone() can
actually take quite a while.  As I've said, I'm not an expert on the
memory manager but I'm not convinced that it's wrong for the direct
reclaim path to be pretty slow at times, especially when I'm putting
an abnormally high amount of stress on it.

I'm going to take this as further evidence that the patch being
discussed in this thread is a good one (AKA don't hold the dm bufio
lock while allocating memory).  :)  If it's unexpected that
shrink_zone() might take several seconds when under extreme memory
pressure then I can do some additional digging.  Do note that I am
running with "zram" and remember that I'm on an ancient 4.4-based
kernel, so perhaps one of those two factors causes problems.


> If you want to handle such a situation where you run 4 tasks each eating
> 900MB, just use more memory, don't expect that this will work smoothly on
> 4GB machine.
>
> If you want to protect the chromebook from runaway memory allocations, you
> can detect this situation in some watchdog process and either kill the
> process that consumes most memory with the kill syscall or trigger the
> kernel OOM killer by writing 'f' to /proc/sysrq-trigger.
>
> The question is what you really want - handle this situation smoothly
> (then, you must add more memory) or protect chromeOS from applications
> allocating too much memory?

These are definitely important things to think about.  In the end user
bug report, though, we were seeing this super slow / janky behavior
_without_ a runaway process--they were just using memory (and swap)
normally.  I've simulated it with the "balloon" process, but users
were seeing it without anything so crazy.


-Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
  2016-12-13  0:08       ` Doug Anderson
@ 2016-12-13 22:01         ` Doug Anderson
  0 siblings, 0 replies; 12+ messages in thread
From: Doug Anderson @ 2016-12-13 22:01 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
	linux-kernel, linux-raid, dm-devel, David Rientjes, Sonny Rao,
	Guenter Roeck

Hi,

On Mon, Dec 12, 2016 at 4:08 PM, Doug Anderson <dianders@chromium.org> wrote:
> OK, so I just put a printk in wait_iff_congested() and it didn't show
> me waiting for the timeout (!).  I know that I saw
> wait_iff_congested() in the originally reproduction of this problem,
> but it appears that in my little "balloon" reproduction it's not
> actually involved...
>
>
> ...I dug further and it appears that __alloc_pages_direct_reclaim() is
> actually what's slow.  Specifically it looks as if shrink_zone() can
> actually take quite a while.  As I've said, I'm not an expert on the
> memory manager but I'm not convinced that it's wrong for the direct
> reclaim path to be pretty slow at times, especially when I'm putting
> an abnormally high amount of stress on it.
>
> I'm going to take this as further evidence that the patch being
> discussed in this thread is a good one (AKA don't hold the dm bufio
> lock while allocating memory).  :)  If it's unexpected that
> shrink_zone() might take several seconds when under extreme memory
> pressure then I can do some additional digging.  Do note that I am
> running with "zram" and remember that I'm on an ancient 4.4-based
> kernel, so perhaps one of those two factors causes problems.

Sadly, I couldn't get this go as just "the way things were" in case
there was some major speedup to be had here.  :-P

I tracked this down to shrink_list() taking 1 ms per call (perhaps
because I have HZ=1000?) and in shrink_lruvec() the outer loop ran
many thousands of times.  Thus the total time taken by shrink_lruvec()
could easily be many seconds.

Wow, interesting, when I change HZ to 100 instead of 1000 then the
behavior changes quite a bit.  I can still get my bufio lock warning
easily, but all of a sudden shrink_lruvec() isn't slow.  :-P

OK, really truly going to stop digging further now...  ;)  Presumably
reporting weird behaviors with old kernels doesn't help anyone in
mainline, and I can buy the whole "memory accesses are slow when you
start thrashing the system" argument.

-Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
  2016-11-23 20:57 ` Mikulas Patocka
  2016-12-08  0:54   ` Doug Anderson
@ 2016-12-15  0:53   ` Doug Anderson
  2016-12-15  0:55     ` Doug Anderson
  1 sibling, 1 reply; 12+ messages in thread
From: Doug Anderson @ 2016-12-15  0:53 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Alasdair Kergon, Shaohua Li, Dmitry Torokhov, linux-kernel,
	linux-raid, dm-devel, David Rientjes, Sonny Rao, Guenter Roeck

Hi,

On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> Hi
>
> The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
> allocation doesn't. Your patch would incorrectly reuse buffers in a
> situation when the memory is filled with clean cached pages.
>
> Here I'm proposing an alternate patch that first tries GFP_NOWAIT
> allocation, then drops the lock and tries GFP_NOIO allocation.
>
> Note that the root cause why you are seeing this stacktrace is, that your
> block device is congested - i.e. there are too many requests in the
> device's queue - and note that fixing this wait won't fix the root cause
> (congested device).
>
> The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
> 13/16 size of the nr_requests value.
>
> If you don't want your device to report the congested status, you can
> increase /sys/block/<device>/queue/nr_requests - you should test if your
> chromebook is faster of slower with this setting increased. But note that
> this setting won't increase the IO-per-second of the device.
>
> Mikulas
>
>
> On Thu, 17 Nov 2016, Douglas Anderson wrote:
>
>> We've seen in-field reports showing _lots_ (18 in one case, 41 in
>> another) of tasks all sitting there blocked on:
>>
>>   mutex_lock+0x4c/0x68
>>   dm_bufio_shrink_count+0x38/0x78
>>   shrink_slab.part.54.constprop.65+0x100/0x464
>>   shrink_zone+0xa8/0x198
>>
>> In the two cases analyzed, we see one task that looks like this:
>>
>>   Workqueue: kverityd verity_prefetch_io
>>
>>   __switch_to+0x9c/0xa8
>>   __schedule+0x440/0x6d8
>>   schedule+0x94/0xb4
>>   schedule_timeout+0x204/0x27c
>>   schedule_timeout_uninterruptible+0x44/0x50
>>   wait_iff_congested+0x9c/0x1f0
>>   shrink_inactive_list+0x3a0/0x4cc
>>   shrink_lruvec+0x418/0x5cc
>>   shrink_zone+0x88/0x198
>>   try_to_free_pages+0x51c/0x588
>>   __alloc_pages_nodemask+0x648/0xa88
>>   __get_free_pages+0x34/0x7c
>>   alloc_buffer+0xa4/0x144
>>   __bufio_new+0x84/0x278
>>   dm_bufio_prefetch+0x9c/0x154
>>   verity_prefetch_io+0xe8/0x10c
>>   process_one_work+0x240/0x424
>>   worker_thread+0x2fc/0x424
>>   kthread+0x10c/0x114
>>
>> ...and that looks to be the one holding the mutex.
>>
>> The problem has been reproduced on fairly easily:
>> 0. Be running Chrome OS w/ verity enabled on the root filesystem
>> 1. Pick test patch: http://crosreview.com/412360
>> 2. Install launchBalloons.sh and balloon.arm from
>>      http://crbug.com/468342
>>    ...that's just a memory stress test app.
>> 3. On a 4GB rk3399 machine, run
>>      nice ./launchBalloons.sh 4 900 100000
>>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
>> 4. Login to the Chrome web browser and restore many tabs
>>
>> With that, I've seen printouts like:
>>   DOUG: long bufio 90758 ms
>> ...and stack trace always show's we're in dm_bufio_prefetch().
>>
>> The problem is that we try to allocate memory with GFP_NOIO while
>> we're holding the dm_bufio lock.  Instead we should be using
>> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
>> lock and that causes the above problems.
>>
>> The current behavior explained by David Rientjes:
>>
>>   It will still try reclaim initially because __GFP_WAIT (or
>>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>>   contention on dm_bufio_lock() that the thread holds.  You want to
>>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>>   mutex that can be contended by a concurrent slab shrinker (if
>>   count_objects didn't use a trylock, this pattern would trivially
>>   deadlock).
>>
>> Suggested-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>> Note that this change was developed and tested against the Chrome OS
>> 4.4 kernel tree, not mainline.  Due to slight differences in verity
>> between mainline and Chrome OS it became too difficult to reproduce my
>> testing setup on mainline.  This patch still seems correct and
>> relevant to upstream, so I'm posting it.  If this is not acceptible to
>> you then please ignore this patch.
>>
>> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
>> reproduce the long delays described in the patch.  Presumably
>> something changed in either the kernel config or the memory management
>> code between the two kernel versions that made this crop up.  In a
>> similar vein, it is possible that problems described in this patch are
>> no longer reproducible upstream.  However, the arguments made in this
>> patch (that we don't want to block while holding the mutex) still
>> apply so I think the patch may still have merit.
>>
>>  drivers/md/dm-bufio.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>> index b3ba142e59a4..3c767399cc59 100644
>> --- a/drivers/md/dm-bufio.c
>> +++ b/drivers/md/dm-bufio.c
>> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>        * dm-bufio is resistant to allocation failures (it just keeps
>>        * one buffer reserved in cases all the allocations fail).
>>        * So set flags to not try too hard:
>> -      *      GFP_NOIO: don't recurse into the I/O layer
>> +      *      GFP_NOWAIT: don't wait; if we need to sleep we'll release our
>> +      *                  mutex and wait ourselves.
>>        *      __GFP_NORETRY: don't retry and rather return failure
>>        *      __GFP_NOMEMALLOC: don't use emergency reserves
>>        *      __GFP_NOWARN: don't print a warning in case of failure
>> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>        */
>>       while (1) {
>>               if (dm_bufio_cache_size_latch != 1) {
>> -                     b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> +                     b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
>> +                                      __GFP_NOMEMALLOC | __GFP_NOWARN);
>>                       if (b)
>>                               return b;
>>               }
>> --
>> 2.8.0.rc3.226.g39d4020
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> Subject: dm-bufio: drop the lock when doing GFP_NOIO alloaction
>
> Drop the lock when doing GFP_NOIO alloaction beacuse the allocation can
> take some time.
>
> Note that we won't do GFP_NOIO allocation when we loop for the second
> time, because the lock shouldn't be dropped between __wait_for_free_buffer
> and __get_unclaimed_buffer.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  drivers/md/dm-bufio.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/drivers/md/dm-bufio.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-bufio.c
> +++ linux-2.6/drivers/md/dm-bufio.c
> @@ -822,11 +822,13 @@ enum new_flag {
>  static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
>  {
>         struct dm_buffer *b;
> +       bool tried_noio_alloc = false;
>
>         /*
>          * dm-bufio is resistant to allocation failures (it just keeps
>          * one buffer reserved in cases all the allocations fail).
>          * So set flags to not try too hard:
> +        *      GFP_NOWAIT: don't sleep and don't release cache
>          *      GFP_NOIO: don't recurse into the I/O layer
>          *      __GFP_NORETRY: don't retry and rather return failure
>          *      __GFP_NOMEMALLOC: don't use emergency reserves
> @@ -837,7 +839,7 @@ static struct dm_buffer *__alloc_buffer_
>          */
>         while (1) {
>                 if (dm_bufio_cache_size_latch != 1) {
> -                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +                       b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>                         if (b)
>                                 return b;
>                 }
> @@ -845,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_
>                 if (nf == NF_PREFETCH)
>                         return NULL;
>
> +               if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> +                       dm_bufio_unlock(c);
> +                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +                       dm_bufio_lock(c);
> +                       if (b)
> +                               return b;
> +                       tried_noio_alloc = true;
> +               }
> +
>                 if (!list_empty(&c->reserved_buffers)) {
>                         b = list_entry(c->reserved_buffers.next,
>                                        struct dm_buffer, lru_list);

I saw a git pull go by today from Mike Snitzer with my version of the
patch in it.  I think this is fine because I think my version of the
patch works all right, but I think Mikulas's version of the patch (see
above) is even better.

Since the "git pull" was to Linus and I believe that my version of the
patch is functional (even if it's not optimal), maybe the right thing
to do is to send a new patch with Mikulas's changes atop mine.
Mikulas: does that sound good to you?  Do you want to send it?

Thanks!

-Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
  2016-12-15  0:53   ` Doug Anderson
@ 2016-12-15  0:55     ` Doug Anderson
  0 siblings, 0 replies; 12+ messages in thread
From: Doug Anderson @ 2016-12-15  0:55 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Alasdair Kergon, Shaohua Li, Dmitry Torokhov, linux-kernel,
	linux-raid, dm-devel, David Rientjes, Sonny Rao, Guenter Roeck

Hi,

On Wed, Dec 14, 2016 at 4:53 PM, Doug Anderson <dianders@chromium.org> wrote:
> Hi,
>
> On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> Hi
>>
>> The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
>> allocation doesn't. Your patch would incorrectly reuse buffers in a
>> situation when the memory is filled with clean cached pages.
>>
>> Here I'm proposing an alternate patch that first tries GFP_NOWAIT
>> allocation, then drops the lock and tries GFP_NOIO allocation.
>>
>> Note that the root cause why you are seeing this stacktrace is, that your
>> block device is congested - i.e. there are too many requests in the
>> device's queue - and note that fixing this wait won't fix the root cause
>> (congested device).
>>
>> The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
>> 13/16 size of the nr_requests value.
>>
>> If you don't want your device to report the congested status, you can
>> increase /sys/block/<device>/queue/nr_requests - you should test if your
>> chromebook is faster of slower with this setting increased. But note that
>> this setting won't increase the IO-per-second of the device.
>>
>> Mikulas
>>
>>
>> On Thu, 17 Nov 2016, Douglas Anderson wrote:
>>
>>> We've seen in-field reports showing _lots_ (18 in one case, 41 in
>>> another) of tasks all sitting there blocked on:
>>>
>>>   mutex_lock+0x4c/0x68
>>>   dm_bufio_shrink_count+0x38/0x78
>>>   shrink_slab.part.54.constprop.65+0x100/0x464
>>>   shrink_zone+0xa8/0x198
>>>
>>> In the two cases analyzed, we see one task that looks like this:
>>>
>>>   Workqueue: kverityd verity_prefetch_io
>>>
>>>   __switch_to+0x9c/0xa8
>>>   __schedule+0x440/0x6d8
>>>   schedule+0x94/0xb4
>>>   schedule_timeout+0x204/0x27c
>>>   schedule_timeout_uninterruptible+0x44/0x50
>>>   wait_iff_congested+0x9c/0x1f0
>>>   shrink_inactive_list+0x3a0/0x4cc
>>>   shrink_lruvec+0x418/0x5cc
>>>   shrink_zone+0x88/0x198
>>>   try_to_free_pages+0x51c/0x588
>>>   __alloc_pages_nodemask+0x648/0xa88
>>>   __get_free_pages+0x34/0x7c
>>>   alloc_buffer+0xa4/0x144
>>>   __bufio_new+0x84/0x278
>>>   dm_bufio_prefetch+0x9c/0x154
>>>   verity_prefetch_io+0xe8/0x10c
>>>   process_one_work+0x240/0x424
>>>   worker_thread+0x2fc/0x424
>>>   kthread+0x10c/0x114
>>>
>>> ...and that looks to be the one holding the mutex.
>>>
>>> The problem has been reproduced on fairly easily:
>>> 0. Be running Chrome OS w/ verity enabled on the root filesystem
>>> 1. Pick test patch: http://crosreview.com/412360
>>> 2. Install launchBalloons.sh and balloon.arm from
>>>      http://crbug.com/468342
>>>    ...that's just a memory stress test app.
>>> 3. On a 4GB rk3399 machine, run
>>>      nice ./launchBalloons.sh 4 900 100000
>>>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
>>> 4. Login to the Chrome web browser and restore many tabs
>>>
>>> With that, I've seen printouts like:
>>>   DOUG: long bufio 90758 ms
>>> ...and stack trace always show's we're in dm_bufio_prefetch().
>>>
>>> The problem is that we try to allocate memory with GFP_NOIO while
>>> we're holding the dm_bufio lock.  Instead we should be using
>>> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
>>> lock and that causes the above problems.
>>>
>>> The current behavior explained by David Rientjes:
>>>
>>>   It will still try reclaim initially because __GFP_WAIT (or
>>>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>>>   contention on dm_bufio_lock() that the thread holds.  You want to
>>>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>>>   mutex that can be contended by a concurrent slab shrinker (if
>>>   count_objects didn't use a trylock, this pattern would trivially
>>>   deadlock).
>>>
>>> Suggested-by: David Rientjes <rientjes@google.com>
>>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>>> ---
>>> Note that this change was developed and tested against the Chrome OS
>>> 4.4 kernel tree, not mainline.  Due to slight differences in verity
>>> between mainline and Chrome OS it became too difficult to reproduce my
>>> testing setup on mainline.  This patch still seems correct and
>>> relevant to upstream, so I'm posting it.  If this is not acceptible to
>>> you then please ignore this patch.
>>>
>>> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
>>> reproduce the long delays described in the patch.  Presumably
>>> something changed in either the kernel config or the memory management
>>> code between the two kernel versions that made this crop up.  In a
>>> similar vein, it is possible that problems described in this patch are
>>> no longer reproducible upstream.  However, the arguments made in this
>>> patch (that we don't want to block while holding the mutex) still
>>> apply so I think the patch may still have merit.
>>>
>>>  drivers/md/dm-bufio.c | 6 ++++--
>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>>> index b3ba142e59a4..3c767399cc59 100644
>>> --- a/drivers/md/dm-bufio.c
>>> +++ b/drivers/md/dm-bufio.c
>>> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>>        * dm-bufio is resistant to allocation failures (it just keeps
>>>        * one buffer reserved in cases all the allocations fail).
>>>        * So set flags to not try too hard:
>>> -      *      GFP_NOIO: don't recurse into the I/O layer
>>> +      *      GFP_NOWAIT: don't wait; if we need to sleep we'll release our
>>> +      *                  mutex and wait ourselves.
>>>        *      __GFP_NORETRY: don't retry and rather return failure
>>>        *      __GFP_NOMEMALLOC: don't use emergency reserves
>>>        *      __GFP_NOWARN: don't print a warning in case of failure
>>> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>>        */
>>>       while (1) {
>>>               if (dm_bufio_cache_size_latch != 1) {
>>> -                     b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>>> +                     b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
>>> +                                      __GFP_NOMEMALLOC | __GFP_NOWARN);
>>>                       if (b)
>>>                               return b;
>>>               }
>>> --
>>> 2.8.0.rc3.226.g39d4020
>>>
>>> --
>>> dm-devel mailing list
>>> dm-devel@redhat.com
>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>>
>>
>> From: Mikulas Patocka <mpatocka@redhat.com>
>>
>> Subject: dm-bufio: drop the lock when doing GFP_NOIO alloaction
>>
>> Drop the lock when doing GFP_NOIO alloaction beacuse the allocation can
>> take some time.
>>
>> Note that we won't do GFP_NOIO allocation when we loop for the second
>> time, because the lock shouldn't be dropped between __wait_for_free_buffer
>> and __get_unclaimed_buffer.
>>
>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>
>> ---
>>  drivers/md/dm-bufio.c |   13 ++++++++++++-
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6/drivers/md/dm-bufio.c
>> ===================================================================
>> --- linux-2.6.orig/drivers/md/dm-bufio.c
>> +++ linux-2.6/drivers/md/dm-bufio.c
>> @@ -822,11 +822,13 @@ enum new_flag {
>>  static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
>>  {
>>         struct dm_buffer *b;
>> +       bool tried_noio_alloc = false;
>>
>>         /*
>>          * dm-bufio is resistant to allocation failures (it just keeps
>>          * one buffer reserved in cases all the allocations fail).
>>          * So set flags to not try too hard:
>> +        *      GFP_NOWAIT: don't sleep and don't release cache
>>          *      GFP_NOIO: don't recurse into the I/O layer
>>          *      __GFP_NORETRY: don't retry and rather return failure
>>          *      __GFP_NOMEMALLOC: don't use emergency reserves
>> @@ -837,7 +839,7 @@ static struct dm_buffer *__alloc_buffer_
>>          */
>>         while (1) {
>>                 if (dm_bufio_cache_size_latch != 1) {
>> -                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> +                       b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>>                         if (b)
>>                                 return b;
>>                 }
>> @@ -845,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_
>>                 if (nf == NF_PREFETCH)
>>                         return NULL;
>>
>> +               if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
>> +                       dm_bufio_unlock(c);
>> +                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> +                       dm_bufio_lock(c);
>> +                       if (b)
>> +                               return b;
>> +                       tried_noio_alloc = true;
>> +               }
>> +
>>                 if (!list_empty(&c->reserved_buffers)) {
>>                         b = list_entry(c->reserved_buffers.next,
>>                                        struct dm_buffer, lru_list);
>
> I saw a git pull go by today from Mike Snitzer with my version of the
> patch in it.  I think this is fine because I think my version of the
> patch works all right, but I think Mikulas's version of the patch (see
> above) is even better.
>
> Since the "git pull" was to Linus and I believe that my version of the
> patch is functional (even if it's not optimal), maybe the right thing
> to do is to send a new patch with Mikulas's changes atop mine.
> Mikulas: does that sound good to you?  Do you want to send it?

Dang, or I could read the full list of patches in the pull and realize
you guys already did that.  Sigh.  Sorry for the noise.

-Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-12-15  0:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-17 19:24 [PATCH] dm: Avoid sleeping while holding the dm_bufio lock Douglas Anderson
2016-11-17 20:28 ` Mike Snitzer
2016-11-17 20:44   ` Doug Anderson
2016-11-17 20:48     ` Mike Snitzer
2016-11-17 21:56 ` [PATCH] " Guenter Roeck
2016-11-23 20:57 ` Mikulas Patocka
2016-12-08  0:54   ` Doug Anderson
2016-12-08 23:20     ` Mikulas Patocka
2016-12-13  0:08       ` Doug Anderson
2016-12-13 22:01         ` Doug Anderson
2016-12-15  0:53   ` Doug Anderson
2016-12-15  0:55     ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).