All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] blk: improve order of bio handling in generic_make_request()
@ 2017-03-03  5:14 NeilBrown
  2017-03-03  9:28 ` Jack Wang
  0 siblings, 1 reply; 37+ messages in thread
From: NeilBrown @ 2017-03-03  5:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: LKML, Jinpu Wang, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka

[-- Attachment #1: Type: text/plain, Size: 5471 bytes --]


[ Hi Jens,
  you might have seen assorted email threads recently about
  deadlocks, particular in dm-snap or md/raid1/10.  Also about
  the excess of rescuer threads.
  I think a big part of the problem is my ancient improvement
  to generic_make_request to queue bios and handle them in
  a strict FIFO order.  As described below, that can cause
  problems which individual block devices cannot fix themselves
  without punting to various threads.
  This patch does not fix everything, but provides a basis that
  drives can build on to create dead-lock free solutions without
  excess threads.
  If you accept this, I will look into improving at least md
  and bio_alloc_set() to be less dependant on rescuer threads.
  Thanks,
  NeilBrown
 ]


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, it we can be certain that all previously
submited request for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove the deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..ef55f210dd7c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list hold;
+			struct bio_list lower, same;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
+
 			bio = bio_list_pop(current->bio_list);
 		} else {
 			struct bio *bio_next = bio_list_pop(current->bio_list);
-- 
2.11.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] blk: improve order of bio handling in generic_make_request()
  2017-03-03  5:14 [PATCH] blk: improve order of bio handling in generic_make_request() NeilBrown
@ 2017-03-03  9:28 ` Jack Wang
  2017-03-06  4:40   ` NeilBrown
  0 siblings, 1 reply; 37+ messages in thread
From: Jack Wang @ 2017-03-03  9:28 UTC (permalink / raw)
  To: NeilBrown, Jens Axboe
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka



On 03.03.2017 06:14, NeilBrown wrote:
> 
> [ Hi Jens,
>   you might have seen assorted email threads recently about
>   deadlocks, particular in dm-snap or md/raid1/10.  Also about
>   the excess of rescuer threads.
>   I think a big part of the problem is my ancient improvement
>   to generic_make_request to queue bios and handle them in
>   a strict FIFO order.  As described below, that can cause
>   problems which individual block devices cannot fix themselves
>   without punting to various threads.
>   This patch does not fix everything, but provides a basis that
>   drives can build on to create dead-lock free solutions without
>   excess threads.
>   If you accept this, I will look into improving at least md
>   and bio_alloc_set() to be less dependant on rescuer threads.
>   Thanks,
>   NeilBrown
>  ]
> 
> 
> To avoid recursion on the kernel stack when stacked block devices
> are in use, generic_make_request() will, when called recursively,
> queue new requests for later handling.  They will be handled when the
> make_request_fn for the current bio completes.
> 
> If any bios are submitted by a make_request_fn, these will ultimately
> handled seqeuntially.  If the handling of one of those generates
> further requests, they will be added to the end of the queue.
> 
> This strict first-in-first-out behaviour can lead to deadlocks in
> various ways, normally because a request might need to wait for a
> previous request to the same device to complete.  This can happen when
> they share a mempool, and can happen due to interdependencies
> particular to the device.  Both md and dm have examples where this happens.
> 
> These deadlocks can be erradicated by more selective ordering of bios.
> Specifically by handling them in depth-first order.  That is: when the
> handling of one bio generates one or more further bios, they are
> handled immediately after the parent, before any siblings of the
> parent.  That way, when generic_make_request() calls make_request_fn
> for some particular device, it we can be certain that all previously
> submited request for that device have been completely handled and are
> not waiting for anything in the queue of requests maintained in
> generic_make_request().
> 
> An easy way to achieve this would be to use a last-in-first-out stack
> instead of a queue.  However this will change the order of consecutive
> bios submitted by a make_request_fn, which could have unexpected consequences.
> Instead we take a slightly more complex approach.
> A fresh queue is created for each call to a make_request_fn.  After it completes,
> any bios for a different device are placed on the front of the main queue, followed
> by any bios for the same device, followed by all bios that were already on
> the queue before the make_request_fn was called.
> This provides the depth-first approach without reordering bios on the same level.
> 
> This, by itself, it not enough to remove the deadlocks.  It just makes
> it possible for drivers to take the extra step required themselves.
> 
> To avoid deadlocks, drivers must never risk waiting for a request
> after submitting one to generic_make_request.  This includes never
> allocing from a mempool twice in the one call to a make_request_fn.
> 
> A common pattern in drivers is to call bio_split() in a loop, handling
> the first part and then looping around to possibly split the next part.
> Instead, a driver that finds it needs to split a bio should queue
> (with generic_make_request) the second part, handle the first part,
> and then return.  The new code in generic_make_request will ensure the
> requests to underlying bios are processed first, then the second bio
> that was split off.  If it splits again, the same process happens.  In
> each case one bio will be completely handled before the next one is attempted.
> 
> With this is place, it should be possible to disable the
> punt_bios_to_recover() recovery thread for many block devices, and
> eventually it may be possible to remove it completely.
> 
> Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  block/blk-core.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b9e857f4afe8..ef55f210dd7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, false) == 0)) {
> +			struct bio_list hold;
> +			struct bio_list lower, same;
> +
> +			/* Create a fresh bio_list for all subordinate requests */
> +			bio_list_init(&hold);
> +			bio_list_merge(&hold, &bio_list_on_stack);
> +			bio_list_init(&bio_list_on_stack);
>  			ret = q->make_request_fn(q, bio);
>  
>  			blk_queue_exit(q);
>  
> +			/* sort new bios into those for a lower level
> +			 * and those for the same level
> +			 */
> +			bio_list_init(&lower);
> +			bio_list_init(&same);
> +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> +				if (q == bdev_get_queue(bio->bi_bdev))
> +					bio_list_add(&same, bio);
> +				else
> +					bio_list_add(&lower, bio);
> +			/* now assemble so we handle the lowest level first */
> +			bio_list_merge(&bio_list_on_stack, &lower);
> +			bio_list_merge(&bio_list_on_stack, &same);
> +			bio_list_merge(&bio_list_on_stack, &hold);
> +
>  			bio = bio_list_pop(current->bio_list);
>  		} else {
>  			struct bio *bio_next = bio_list_pop(current->bio_list);
> 

Thanks Neil for pushing the fix.

We can optimize generic_make_request a little bit:
- assign bio_list struct hold directly instead init and merge
- remove duplicate code

I think better to squash into your fix.
---
 block/blk-core.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bc7202..b29b7e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 			struct bio_list lower, same, hold;
 
 			/* Create a fresh bio_list for all subordinate requests */
-			bio_list_init(&hold);
-			bio_list_merge(&hold, &bio_list_on_stack);
+			hold = bio_list_on_stack;
 			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
@@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
 			bio_list_merge(&bio_list_on_stack, &lower);
 			bio_list_merge(&bio_list_on_stack, &same);
 			bio_list_merge(&bio_list_on_stack, &hold);
-
-			bio = bio_list_pop(current->bio_list);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(current->bio_list);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
-- 

Regards,
Jack 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] blk: improve order of bio handling in generic_make_request()
  2017-03-03  9:28 ` Jack Wang
@ 2017-03-06  4:40   ` NeilBrown
  2017-03-06  9:43     ` Jack Wang
  2017-03-06 20:18     ` Jens Axboe
  0 siblings, 2 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-06  4:40 UTC (permalink / raw)
  To: Jack Wang, Jens Axboe
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka

[-- Attachment #1: Type: text/plain, Size: 1858 bytes --]

On Fri, Mar 03 2017, Jack Wang wrote:
>
> Thanks Neil for pushing the fix.
>
> We can optimize generic_make_request a little bit:
> - assign bio_list struct hold directly instead init and merge
> - remove duplicate code
>
> I think better to squash into your fix.

Hi Jack,
 I don't object to your changes, but I'd like to see a response from
 Jens first.
 My preference would be to get the original patch in, then other changes
 that build on it, such as this one, can be added.  Until the core
 changes lands, any other work is pointless.

 Of course if Jens wants a this merged before he'll apply it, I'll
 happily do that.

Thanks,
NeilBrown



> ---
>  block/blk-core.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3bc7202..b29b7e5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  			struct bio_list lower, same, hold;
>  
>  			/* Create a fresh bio_list for all subordinate requests */
> -			bio_list_init(&hold);
> -			bio_list_merge(&hold, &bio_list_on_stack);
> +			hold = bio_list_on_stack;
>  			bio_list_init(&bio_list_on_stack);
>  
>  			ret = q->make_request_fn(q, bio);
> @@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>  			bio_list_merge(&bio_list_on_stack, &lower);
>  			bio_list_merge(&bio_list_on_stack, &same);
>  			bio_list_merge(&bio_list_on_stack, &hold);
> -
> -			bio = bio_list_pop(current->bio_list);
>  		} else {
> -			struct bio *bio_next = bio_list_pop(current->bio_list);
> -
>  			bio_io_error(bio);
> -			bio = bio_next;
>  		}
> +		bio = bio_list_pop(current->bio_list);
>  	} while (bio);
>  	current->bio_list = NULL; /* deactivate */
>  
> -- 
>
> Regards,
> Jack 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] blk: improve order of bio handling in generic_make_request()
  2017-03-06  4:40   ` NeilBrown
@ 2017-03-06  9:43     ` Jack Wang
  2017-03-07 15:46       ` Pavel Machek
  2017-03-06 20:18     ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Jack Wang @ 2017-03-06  9:43 UTC (permalink / raw)
  To: NeilBrown, Jens Axboe
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka



On 06.03.2017 05:40, NeilBrown wrote:
> On Fri, Mar 03 2017, Jack Wang wrote:
>>
>> Thanks Neil for pushing the fix.
>>
>> We can optimize generic_make_request a little bit:
>> - assign bio_list struct hold directly instead init and merge
>> - remove duplicate code
>>
>> I think better to squash into your fix.
> 
> Hi Jack,
>  I don't object to your changes, but I'd like to see a response from
>  Jens first.
>  My preference would be to get the original patch in, then other changes
>  that build on it, such as this one, can be added.  Until the core
>  changes lands, any other work is pointless.
> 
>  Of course if Jens wants a this merged before he'll apply it, I'll
>  happily do that.
> 
> Thanks,
> NeilBrown

Hi Neil,

Totally agree, let's wait for Jens's decision.

Hi Jens,

Please consider this fix also for stable 4.3+ 

Thanks,
Jack Wang

> 
> 
> 
>> ---
>>  block/blk-core.c | 9 ++-------
>>  1 file changed, 2 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 3bc7202..b29b7e5 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  			struct bio_list lower, same, hold;
>>  
>>  			/* Create a fresh bio_list for all subordinate requests */
>> -			bio_list_init(&hold);
>> -			bio_list_merge(&hold, &bio_list_on_stack);
>> +			hold = bio_list_on_stack;
>>  			bio_list_init(&bio_list_on_stack);
>>  
>>  			ret = q->make_request_fn(q, bio);
>> @@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  			bio_list_merge(&bio_list_on_stack, &lower);
>>  			bio_list_merge(&bio_list_on_stack, &same);
>>  			bio_list_merge(&bio_list_on_stack, &hold);
>> -
>> -			bio = bio_list_pop(current->bio_list);
>>  		} else {
>> -			struct bio *bio_next = bio_list_pop(current->bio_list);
>> -
>>  			bio_io_error(bio);
>> -			bio = bio_next;
>>  		}
>> +		bio = bio_list_pop(current->bio_list);
>>  	} while (bio);
>>  	current->bio_list = NULL; /* deactivate */
>>  
>> -- 
>>
>> Regards,
>> Jack 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] blk: improve order of bio handling in generic_make_request()
  2017-03-06  4:40   ` NeilBrown
  2017-03-06  9:43     ` Jack Wang
@ 2017-03-06 20:18     ` Jens Axboe
  2017-03-07  8:49       ` Jack Wang
  2017-03-07 20:38         ` NeilBrown
  1 sibling, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2017-03-06 20:18 UTC (permalink / raw)
  To: NeilBrown, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka

On 03/05/2017 09:40 PM, NeilBrown wrote:
> On Fri, Mar 03 2017, Jack Wang wrote:
>>
>> Thanks Neil for pushing the fix.
>>
>> We can optimize generic_make_request a little bit:
>> - assign bio_list struct hold directly instead init and merge
>> - remove duplicate code
>>
>> I think better to squash into your fix.
> 
> Hi Jack,
>  I don't object to your changes, but I'd like to see a response from
>  Jens first.
>  My preference would be to get the original patch in, then other changes
>  that build on it, such as this one, can be added.  Until the core
>  changes lands, any other work is pointless.
> 
>  Of course if Jens wants a this merged before he'll apply it, I'll
>  happily do that.

I like the change, and thanks for tackling this. It's been a pending
issue for way too long. I do think we should squash Jack's patch
into the original, as it does clean up the code nicely.

Do we have a proper test case for this, so we can verify that it
does indeed also work in practice?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] blk: improve order of bio handling in generic_make_request()
  2017-03-06 20:18     ` Jens Axboe
@ 2017-03-07  8:49       ` Jack Wang
  2017-03-07 16:52         ` Mike Snitzer
  2017-03-07 20:38         ` NeilBrown
  1 sibling, 1 reply; 37+ messages in thread
From: Jack Wang @ 2017-03-07  8:49 UTC (permalink / raw)
  To: Jens Axboe, NeilBrown
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka



On 06.03.2017 21:18, Jens Axboe wrote:
> On 03/05/2017 09:40 PM, NeilBrown wrote:
>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>
>>> Thanks Neil for pushing the fix.
>>>
>>> We can optimize generic_make_request a little bit:
>>> - assign bio_list struct hold directly instead init and merge
>>> - remove duplicate code
>>>
>>> I think better to squash into your fix.
>>
>> Hi Jack,
>>  I don't object to your changes, but I'd like to see a response from
>>  Jens first.
>>  My preference would be to get the original patch in, then other changes
>>  that build on it, such as this one, can be added.  Until the core
>>  changes lands, any other work is pointless.
>>
>>  Of course if Jens wants a this merged before he'll apply it, I'll
>>  happily do that.
> 
> I like the change, and thanks for tackling this. It's been a pending
> issue for way too long. I do think we should squash Jack's patch
> into the original, as it does clean up the code nicely.
> 
> Do we have a proper test case for this, so we can verify that it
> does indeed also work in practice?
> 
Hi Jens,

I can trigger deadlock with in RAID1 with test below:

I create one md with one local loop device and one remote scsi
exported by SRP. running fio with mix rw on top of md, force_close
session on storage side. mdx_raid1 is wait on free_array in D state,
and a lot of fio also in D state in wait_barrier.

With the patch from Neil above, I can no longer trigger it anymore.

The discussion was in link below:
http://www.spinics.net/lists/raid/msg54680.html

Thanks,
Jack Wang

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] blk: improve order of bio handling in generic_make_request()
  2017-03-06  9:43     ` Jack Wang
@ 2017-03-07 15:46       ` Pavel Machek
  2017-03-07 15:53         ` Jack Wang
  2017-03-07 16:21         ` Jens Axboe
  0 siblings, 2 replies; 37+ messages in thread
From: Pavel Machek @ 2017-03-07 15:46 UTC (permalink / raw)
  To: Jack Wang
  Cc: NeilBrown, Jens Axboe, LKML, Lars Ellenberg, Kent Overstreet,
	Mike Snitzer, Mikulas Patocka

[-- Attachment #1: Type: text/plain, Size: 1230 bytes --]

On Mon 2017-03-06 10:43:59, Jack Wang wrote:
> 
> 
> On 06.03.2017 05:40, NeilBrown wrote:
> > On Fri, Mar 03 2017, Jack Wang wrote:
> >>
> >> Thanks Neil for pushing the fix.
> >>
> >> We can optimize generic_make_request a little bit:
> >> - assign bio_list struct hold directly instead init and merge
> >> - remove duplicate code
> >>
> >> I think better to squash into your fix.
> > 
> > Hi Jack,
> >  I don't object to your changes, but I'd like to see a response from
> >  Jens first.
> >  My preference would be to get the original patch in, then other changes
> >  that build on it, such as this one, can be added.  Until the core
> >  changes lands, any other work is pointless.
> > 
> >  Of course if Jens wants a this merged before he'll apply it, I'll
> >  happily do that.
> > 
> > Thanks,
> > NeilBrown
> 
> Hi Neil,
> 
> Totally agree, let's wait for Jens's decision.
> 
> Hi Jens,
> 
> Please consider this fix also for stable 4.3+ 

Stable? We don't put this into stable, with exception of minimal fixes
for real bugs...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] blk: improve order of bio handling in generic_make_request()
  2017-03-07 15:46       ` Pavel Machek
@ 2017-03-07 15:53         ` Jack Wang
  2017-03-07 16:21         ` Jens Axboe
  1 sibling, 0 replies; 37+ messages in thread
From: Jack Wang @ 2017-03-07 15:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: NeilBrown, Jens Axboe, LKML, Lars Ellenberg, Kent Overstreet,
	Mike Snitzer, Mikulas Patocka



On 07.03.2017 16:46, Pavel Machek wrote:
> On Mon 2017-03-06 10:43:59, Jack Wang wrote:
>>
>>
>> On 06.03.2017 05:40, NeilBrown wrote:
>>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>>
>>>> Thanks Neil for pushing the fix.
>>>>
>>>> We can optimize generic_make_request a little bit:
>>>> - assign bio_list struct hold directly instead init and merge
>>>> - remove duplicate code
>>>>
>>>> I think better to squash into your fix.
>>>
>>> Hi Jack,
>>>  I don't object to your changes, but I'd like to see a response from
>>>  Jens first.
>>>  My preference would be to get the original patch in, then other changes
>>>  that build on it, such as this one, can be added.  Until the core
>>>  changes lands, any other work is pointless.
>>>
>>>  Of course if Jens wants a this merged before he'll apply it, I'll
>>>  happily do that.
>>>
>>> Thanks,
>>> NeilBrown
>>
>> Hi Neil,
>>
>> Totally agree, let's wait for Jens's decision.
>>
>> Hi Jens,
>>
>> Please consider this fix also for stable 4.3+ 
> 
> Stable? We don't put this into stable, with exception of minimal fixes
> for real bugs...
> 									Pavel
> 
It indeed fixes deadlock in RAID1 case.
Please follow the link I replied to Jens regarding the test.
It did hit us in our production.

Thanks,
Jack

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] blk: improve order of bio handling in generic_make_request()
  2017-03-07 15:46       ` Pavel Machek
  2017-03-07 15:53         ` Jack Wang
@ 2017-03-07 16:21         ` Jens Axboe
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-03-07 16:21 UTC (permalink / raw)
  To: Pavel Machek, Jack Wang
  Cc: NeilBrown, LKML, Lars Ellenberg, Kent Overstreet, Mike Snitzer,
	Mikulas Patocka

On 03/07/2017 08:46 AM, Pavel Machek wrote:
> On Mon 2017-03-06 10:43:59, Jack Wang wrote:
>>
>>
>> On 06.03.2017 05:40, NeilBrown wrote:
>>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>>
>>>> Thanks Neil for pushing the fix.
>>>>
>>>> We can optimize generic_make_request a little bit:
>>>> - assign bio_list struct hold directly instead init and merge
>>>> - remove duplicate code
>>>>
>>>> I think better to squash into your fix.
>>>
>>> Hi Jack,
>>>  I don't object to your changes, but I'd like to see a response from
>>>  Jens first.
>>>  My preference would be to get the original patch in, then other changes
>>>  that build on it, such as this one, can be added.  Until the core
>>>  changes lands, any other work is pointless.
>>>
>>>  Of course if Jens wants a this merged before he'll apply it, I'll
>>>  happily do that.
>>>
>>> Thanks,
>>> NeilBrown
>>
>> Hi Neil,
>>
>> Totally agree, let's wait for Jens's decision.
>>
>> Hi Jens,
>>
>> Please consider this fix also for stable 4.3+ 
> 
> Stable? We don't put this into stable, with exception of minimal fixes
> for real bugs...

What are you smoking? How is this not a real bug?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-07  8:49       ` Jack Wang
@ 2017-03-07 16:52         ` Mike Snitzer
  2017-03-07 17:05           ` Jens Axboe
  2017-03-08 11:46           ` Lars Ellenberg
  0 siblings, 2 replies; 37+ messages in thread
From: Mike Snitzer @ 2017-03-07 16:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jack Wang, NeilBrown, LKML, Lars Ellenberg, Kent Overstreet,
	Pavel Machek, Mikulas Patocka

On Tue, Mar 07 2017 at  3:49am -0500,
Jack Wang <jinpu.wang@profitbricks.com> wrote:

> 
> 
> On 06.03.2017 21:18, Jens Axboe wrote:
> > On 03/05/2017 09:40 PM, NeilBrown wrote:
> >> On Fri, Mar 03 2017, Jack Wang wrote:
> >>>
> >>> Thanks Neil for pushing the fix.
> >>>
> >>> We can optimize generic_make_request a little bit:
> >>> - assign bio_list struct hold directly instead init and merge
> >>> - remove duplicate code
> >>>
> >>> I think better to squash into your fix.
> >>
> >> Hi Jack,
> >>  I don't object to your changes, but I'd like to see a response from
> >>  Jens first.
> >>  My preference would be to get the original patch in, then other changes
> >>  that build on it, such as this one, can be added.  Until the core
> >>  changes lands, any other work is pointless.
> >>
> >>  Of course if Jens wants a this merged before he'll apply it, I'll
> >>  happily do that.
> > 
> > I like the change, and thanks for tackling this. It's been a pending
> > issue for way too long. I do think we should squash Jack's patch
> > into the original, as it does clean up the code nicely.
> > 
> > Do we have a proper test case for this, so we can verify that it
> > does indeed also work in practice?
> > 
> Hi Jens,
> 
> I can trigger deadlock with in RAID1 with test below:
> 
> I create one md with one local loop device and one remote scsi
> exported by SRP. running fio with mix rw on top of md, force_close
> session on storage side. mdx_raid1 is wait on free_array in D state,
> and a lot of fio also in D state in wait_barrier.
> 
> With the patch from Neil above, I can no longer trigger it anymore.
> 
> The discussion was in link below:
> http://www.spinics.net/lists/raid/msg54680.html

In addition to Jack's MD raid test there is a DM snapshot deadlock test,
albeit unpolished/needy to get running, see:
https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

But to actually test block core's ability to handle this, upstream
commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 ("dm: flush queued bios
when process blocks to avoid deadlock") would need to be reverted.

Also, I know Lars had a drbd deadlock too.  Not sure if Jack's MD test
is sufficient to coverage for drbd.  Lars?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-07 16:52         ` Mike Snitzer
@ 2017-03-07 17:05           ` Jens Axboe
  2017-03-07 17:14             ` Mike Snitzer
  2017-03-08 11:46           ` Lars Ellenberg
  1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-03-07 17:05 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jack Wang, NeilBrown, LKML, Lars Ellenberg, Kent Overstreet,
	Pavel Machek, Mikulas Patocka

On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> On Tue, Mar 07 2017 at  3:49am -0500,
> Jack Wang <jinpu.wang@profitbricks.com> wrote:
> 
>>
>>
>> On 06.03.2017 21:18, Jens Axboe wrote:
>>> On 03/05/2017 09:40 PM, NeilBrown wrote:
>>>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>>>
>>>>> Thanks Neil for pushing the fix.
>>>>>
>>>>> We can optimize generic_make_request a little bit:
>>>>> - assign bio_list struct hold directly instead init and merge
>>>>> - remove duplicate code
>>>>>
>>>>> I think better to squash into your fix.
>>>>
>>>> Hi Jack,
>>>>  I don't object to your changes, but I'd like to see a response from
>>>>  Jens first.
>>>>  My preference would be to get the original patch in, then other changes
>>>>  that build on it, such as this one, can be added.  Until the core
>>>>  changes lands, any other work is pointless.
>>>>
>>>>  Of course if Jens wants a this merged before he'll apply it, I'll
>>>>  happily do that.
>>>
>>> I like the change, and thanks for tackling this. It's been a pending
>>> issue for way too long. I do think we should squash Jack's patch
>>> into the original, as it does clean up the code nicely.
>>>
>>> Do we have a proper test case for this, so we can verify that it
>>> does indeed also work in practice?
>>>
>> Hi Jens,
>>
>> I can trigger deadlock with in RAID1 with test below:
>>
>> I create one md with one local loop device and one remote scsi
>> exported by SRP. running fio with mix rw on top of md, force_close
>> session on storage side. mdx_raid1 is wait on free_array in D state,
>> and a lot of fio also in D state in wait_barrier.
>>
>> With the patch from Neil above, I can no longer trigger it anymore.
>>
>> The discussion was in link below:
>> http://www.spinics.net/lists/raid/msg54680.html
> 
> In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> albeit unpolished/needy to get running, see:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

Can you run this patch with that test, reverting your DM workaround?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-07 17:05           ` Jens Axboe
@ 2017-03-07 17:14             ` Mike Snitzer
  2017-03-07 20:29               ` NeilBrown
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2017-03-07 17:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jack Wang, NeilBrown, LKML, Lars Ellenberg, Kent Overstreet,
	Pavel Machek, Mikulas Patocka

On Tue, Mar 07 2017 at 12:05pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> > On Tue, Mar 07 2017 at  3:49am -0500,
> > Jack Wang <jinpu.wang@profitbricks.com> wrote:
> > 
> >>
> >>
> >> On 06.03.2017 21:18, Jens Axboe wrote:
> >>> On 03/05/2017 09:40 PM, NeilBrown wrote:
> >>>> On Fri, Mar 03 2017, Jack Wang wrote:
> >>>>>
> >>>>> Thanks Neil for pushing the fix.
> >>>>>
> >>>>> We can optimize generic_make_request a little bit:
> >>>>> - assign bio_list struct hold directly instead init and merge
> >>>>> - remove duplicate code
> >>>>>
> >>>>> I think better to squash into your fix.
> >>>>
> >>>> Hi Jack,
> >>>>  I don't object to your changes, but I'd like to see a response from
> >>>>  Jens first.
> >>>>  My preference would be to get the original patch in, then other changes
> >>>>  that build on it, such as this one, can be added.  Until the core
> >>>>  changes lands, any other work is pointless.
> >>>>
> >>>>  Of course if Jens wants a this merged before he'll apply it, I'll
> >>>>  happily do that.
> >>>
> >>> I like the change, and thanks for tackling this. It's been a pending
> >>> issue for way too long. I do think we should squash Jack's patch
> >>> into the original, as it does clean up the code nicely.
> >>>
> >>> Do we have a proper test case for this, so we can verify that it
> >>> does indeed also work in practice?
> >>>
> >> Hi Jens,
> >>
> >> I can trigger deadlock with in RAID1 with test below:
> >>
> >> I create one md with one local loop device and one remote scsi
> >> exported by SRP. running fio with mix rw on top of md, force_close
> >> session on storage side. mdx_raid1 is wait on free_array in D state,
> >> and a lot of fio also in D state in wait_barrier.
> >>
> >> With the patch from Neil above, I can no longer trigger it anymore.
> >>
> >> The discussion was in link below:
> >> http://www.spinics.net/lists/raid/msg54680.html
> > 
> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> > albeit unpolished/needy to get running, see:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> 
> Can you run this patch with that test, reverting your DM workaround?

Yeap, will do.  Last time Mikulas tried a similar patch it still
deadlocked.  But I'll give it a go (likely tomorrow).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-07 17:14             ` Mike Snitzer
@ 2017-03-07 20:29               ` NeilBrown
  2017-03-07 23:01                 ` Mike Snitzer
  2017-03-08 16:40                 ` Mikulas Patocka
  0 siblings, 2 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-07 20:29 UTC (permalink / raw)
  To: Mike Snitzer, Jens Axboe
  Cc: Jack Wang, LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mikulas Patocka

[-- Attachment #1: Type: text/plain, Size: 3582 bytes --]

On Tue, Mar 07 2017, Mike Snitzer wrote:

> On Tue, Mar 07 2017 at 12:05pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
>> > On Tue, Mar 07 2017 at  3:49am -0500,
>> > Jack Wang <jinpu.wang@profitbricks.com> wrote:
>> > 
>> >>
>> >>
>> >> On 06.03.2017 21:18, Jens Axboe wrote:
>> >>> On 03/05/2017 09:40 PM, NeilBrown wrote:
>> >>>> On Fri, Mar 03 2017, Jack Wang wrote:
>> >>>>>
>> >>>>> Thanks Neil for pushing the fix.
>> >>>>>
>> >>>>> We can optimize generic_make_request a little bit:
>> >>>>> - assign bio_list struct hold directly instead init and merge
>> >>>>> - remove duplicate code
>> >>>>>
>> >>>>> I think better to squash into your fix.
>> >>>>
>> >>>> Hi Jack,
>> >>>>  I don't object to your changes, but I'd like to see a response from
>> >>>>  Jens first.
>> >>>>  My preference would be to get the original patch in, then other changes
>> >>>>  that build on it, such as this one, can be added.  Until the core
>> >>>>  changes lands, any other work is pointless.
>> >>>>
>> >>>>  Of course if Jens wants a this merged before he'll apply it, I'll
>> >>>>  happily do that.
>> >>>
>> >>> I like the change, and thanks for tackling this. It's been a pending
>> >>> issue for way too long. I do think we should squash Jack's patch
>> >>> into the original, as it does clean up the code nicely.
>> >>>
>> >>> Do we have a proper test case for this, so we can verify that it
>> >>> does indeed also work in practice?
>> >>>
>> >> Hi Jens,
>> >>
>> >> I can trigger deadlock with in RAID1 with test below:
>> >>
>> >> I create one md with one local loop device and one remote scsi
>> >> exported by SRP. running fio with mix rw on top of md, force_close
>> >> session on storage side. mdx_raid1 is wait on free_array in D state,
>> >> and a lot of fio also in D state in wait_barrier.
>> >>
>> >> With the patch from Neil above, I can no longer trigger it anymore.
>> >>
>> >> The discussion was in link below:
>> >> http://www.spinics.net/lists/raid/msg54680.html
>> > 
>> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
>> > albeit unpolished/needy to get running, see:
>> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>> 
>> Can you run this patch with that test, reverting your DM workaround?
>
> Yeap, will do.  Last time Mikulas tried a similar patch it still
> deadlocked.  But I'll give it a go (likely tomorrow).

I don't think this will fix the DM snapshot deadlock by itself.
Rather, it make it possible for some internal changes to DM to fix it.
The DM change might be something vaguely like:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664f3..06ee0960e415 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)

 	len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);

+	if (len < ci->sector_count) {
+		struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+		generic_make_request(bio);
+		bio = split;
+		ci->sector_count = len;
+	}
+
 	r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
 	if (r < 0)
 		return r;

Instead of looping inside DM, this change causes the remainder to be
passed to generic_make_request() and DM only handles or region at a
time.  So there is only one loop, in the top generic_make_request().
That loop will not reliable handle bios in the "right" order.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-06 20:18     ` Jens Axboe
@ 2017-03-07 20:38         ` NeilBrown
  2017-03-07 20:38         ` NeilBrown
  1 sibling, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-07 20:38 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Lars Ellenberg, Kent Overstreet


[-- Attachment #1.1: Type: text/plain, Size: 5090 bytes --]


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

Changes since v1:
 - merge code improvements from Jack Wang
 - more edits to changelog comment
 - add Ref: link.
 - Add some lists to Cc, that should have been there the first time.
 


diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..9520e82aa78c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2018,17 +2018,34 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list hold;
+			struct bio_list lower, same;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			hold = bio_list_on_stack;
+			bio_list_init(&bio_list_on_stack);
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
-			bio = bio_list_pop(current->bio_list);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(current->bio_list);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
-- 
2.12.0


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2] blk: improve order of bio handling in generic_make_request()
@ 2017-03-07 20:38         ` NeilBrown
  0 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-07 20:38 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 5090 bytes --]


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

Changes since v1:
 - merge code improvements from Jack Wang
 - more edits to changelog comment
 - add Ref: link.
 - Add some lists to Cc, that should have been there the first time.
 


diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..9520e82aa78c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2018,17 +2018,34 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list hold;
+			struct bio_list lower, same;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			hold = bio_list_on_stack;
+			bio_list_init(&bio_list_on_stack);
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
-			bio = bio_list_pop(current->bio_list);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(current->bio_list);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-07 20:29               ` NeilBrown
@ 2017-03-07 23:01                 ` Mike Snitzer
  2017-03-08 16:40                 ` Mikulas Patocka
  1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2017-03-07 23:01 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Jack Wang, LKML, Lars Ellenberg, Kent Overstreet,
	Pavel Machek, Mikulas Patocka

On Tue, Mar 07 2017 at  3:29pm -0500,
NeilBrown <neilb@suse.com> wrote:

> On Tue, Mar 07 2017, Mike Snitzer wrote:
> 
> > On Tue, Mar 07 2017 at 12:05pm -0500,
> > Jens Axboe <axboe@kernel.dk> wrote:
> >
> >> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> >> > 
> >> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> >> > albeit unpolished/needy to get running, see:
> >> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> >> 
> >> Can you run this patch with that test, reverting your DM workaround?
> >
> > Yeap, will do.  Last time Mikulas tried a similar patch it still
> > deadlocked.  But I'll give it a go (likely tomorrow).
> 
> I don't think this will fix the DM snapshot deadlock by itself.
> Rather, it make it possible for some internal changes to DM to fix it.
> The DM change might be something vaguely like:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664f3..06ee0960e415 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
> 
>  	len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
> 
> +	if (len < ci->sector_count) {
> +		struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
> +		bio_chain(split, bio);
> +		generic_make_request(bio);
> +		bio = split;
> +		ci->sector_count = len;
> +	}
> +
>  	r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
>  	if (r < 0)
>  		return r;
> 
> Instead of looping inside DM, this change causes the remainder to be
> passed to generic_make_request() and DM only handles or region at a
> time.  So there is only one loop, in the top generic_make_request().
> That loop will not reliable handle bios in the "right" order.

s/not reliable/now reliably/ ? ;)

But thanks for the suggestion Neil.  Will dig in once I get through a
backlog of other DM target code I have queued for 4.12 review.

Mike

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-07 16:52         ` Mike Snitzer
  2017-03-07 17:05           ` Jens Axboe
@ 2017-03-08 11:46           ` Lars Ellenberg
  1 sibling, 0 replies; 37+ messages in thread
From: Lars Ellenberg @ 2017-03-08 11:46 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Jack Wang, NeilBrown, LKML, Kent Overstreet,
	Pavel Machek, Mikulas Patocka

On 7 March 2017 at 17:52, Mike Snitzer <snitzer@redhat.com> wrote:

> > On 06.03.2017 21:18, Jens Axboe wrote:
> > > I like the change, and thanks for tackling this. It's been a pending
> > > issue for way too long. I do think we should squash Jack's patch
> > > into the original, as it does clean up the code nicely.
> > >
> > > Do we have a proper test case for this, so we can verify that it
> > > does indeed also work in practice?
> > >
> > Hi Jens,
> >
> > I can trigger deadlock with in RAID1 with test below:
> >
> > I create one md with one local loop device and one remote scsi
> > exported by SRP. running fio with mix rw on top of md, force_close
> > session on storage side. mdx_raid1 is wait on free_array in D state,
> > and a lot of fio also in D state in wait_barrier.
> >
> > With the patch from Neil above, I can no longer trigger it anymore.
> >
> > The discussion was in link below:
> > http://www.spinics.net/lists/raid/msg54680.html
>
> In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> albeit unpolished/needy to get running, see:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>
> But to actually test block core's ability to handle this, upstream
> commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 ("dm: flush queued bios
> when process blocks to avoid deadlock") would need to be reverted.
>
> Also, I know Lars had a drbd deadlock too.  Not sure if Jack's MD test
> is sufficient to coverage for drbd.  Lars?
>

As this is just a slightly different implementation, trading some bytes of
stack
for more local, self-contained, "obvious" code changes (good job!),
but follows the same basic idea as my original RFC [*]  (see the
"inspired-by" tag)
I have no doubt it fixes the issues we are able to provoke with DRBD.
[*] https://lkml.org/lkml/2016/7/19/263
(where I also already suggest to fix the device-mapper issues
by losing the in-device-mapper loop,
relying on the loop in generic_make_request())

Cheers,
    Lars

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-07 20:29               ` NeilBrown
  2017-03-07 23:01                 ` Mike Snitzer
@ 2017-03-08 16:40                 ` Mikulas Patocka
  2017-03-08 17:15                   ` Lars Ellenberg
  2017-03-09  6:08                   ` NeilBrown
  1 sibling, 2 replies; 37+ messages in thread
From: Mikulas Patocka @ 2017-03-08 16:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, Jens Axboe, Jack Wang, LKML, Lars Ellenberg,
	Kent Overstreet, Pavel Machek



On Wed, 8 Mar 2017, NeilBrown wrote:

> On Tue, Mar 07 2017, Mike Snitzer wrote:
> 
> > On Tue, Mar 07 2017 at 12:05pm -0500,
> > Jens Axboe <axboe@kernel.dk> wrote:
> >
> >> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> >> > On Tue, Mar 07 2017 at  3:49am -0500,
> >> > Jack Wang <jinpu.wang@profitbricks.com> wrote:
> >> > 
> >> >>
> >> >>
> >> >> On 06.03.2017 21:18, Jens Axboe wrote:
> >> >>> On 03/05/2017 09:40 PM, NeilBrown wrote:
> >> >>>> On Fri, Mar 03 2017, Jack Wang wrote:
> >> >>>>>
> >> >>>>> Thanks Neil for pushing the fix.
> >> >>>>>
> >> >>>>> We can optimize generic_make_request a little bit:
> >> >>>>> - assign bio_list struct hold directly instead init and merge
> >> >>>>> - remove duplicate code
> >> >>>>>
> >> >>>>> I think better to squash into your fix.
> >> >>>>
> >> >>>> Hi Jack,
> >> >>>>  I don't object to your changes, but I'd like to see a response from
> >> >>>>  Jens first.
> >> >>>>  My preference would be to get the original patch in, then other changes
> >> >>>>  that build on it, such as this one, can be added.  Until the core
> >> >>>>  changes lands, any other work is pointless.
> >> >>>>
> >> >>>>  Of course if Jens wants a this merged before he'll apply it, I'll
> >> >>>>  happily do that.
> >> >>>
> >> >>> I like the change, and thanks for tackling this. It's been a pending
> >> >>> issue for way too long. I do think we should squash Jack's patch
> >> >>> into the original, as it does clean up the code nicely.
> >> >>>
> >> >>> Do we have a proper test case for this, so we can verify that it
> >> >>> does indeed also work in practice?
> >> >>>
> >> >> Hi Jens,
> >> >>
> >> >> I can trigger deadlock with in RAID1 with test below:
> >> >>
> >> >> I create one md with one local loop device and one remote scsi
> >> >> exported by SRP. running fio with mix rw on top of md, force_close
> >> >> session on storage side. mdx_raid1 is wait on free_array in D state,
> >> >> and a lot of fio also in D state in wait_barrier.
> >> >>
> >> >> With the patch from Neil above, I can no longer trigger it anymore.
> >> >>
> >> >> The discussion was in link below:
> >> >> http://www.spinics.net/lists/raid/msg54680.html
> >> > 
> >> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> >> > albeit unpolished/needy to get running, see:
> >> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> >> 
> >> Can you run this patch with that test, reverting your DM workaround?
> >
> > Yeap, will do.  Last time Mikulas tried a similar patch it still
> > deadlocked.  But I'll give it a go (likely tomorrow).
> 
> I don't think this will fix the DM snapshot deadlock by itself.
> Rather, it make it possible for some internal changes to DM to fix it.
> The DM change might be something vaguely like:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664f3..06ee0960e415 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
> 
>  	len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
> 
> +	if (len < ci->sector_count) {
> +		struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);

fs_bio_set is a shared bio set, so it is prone to deadlocks. For this 
change, we would need two bio sets per dm device, one for the split bio 
and one for the outgoing bio. (this also means having one more kernel 
thread per dm device)

It would be possible to avoid having two bio sets if the incoming bio were 
the same as the outgoing bio (allocate a small structure, move bi_end_io 
and bi_private into it, replace bi_end_io and bi_private with pointers to 
device mapper and send the bio to the target driver), but it would need 
much more changes - basically rewrite the whole bio handling code in dm.c 
and in the targets.

Mikulas

> +		bio_chain(split, bio);
> +		generic_make_request(bio);
> +		bio = split;
> +		ci->sector_count = len;
> +	}
> +
>  	r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
>  	if (r < 0)
>  		return r;
> 
> Instead of looping inside DM, this change causes the remainder to be
> passed to generic_make_request() and DM only handles or region at a
> time.  So there is only one loop, in the top generic_make_request().
> That loop will not reliable handle bios in the "right" order.
> 
> Thanks,
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-08 16:40                 ` Mikulas Patocka
@ 2017-03-08 17:15                   ` Lars Ellenberg
  2017-03-09  6:08                   ` NeilBrown
  1 sibling, 0 replies; 37+ messages in thread
From: Lars Ellenberg @ 2017-03-08 17:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Mike Snitzer, Jens Axboe, Jack Wang, LKML,
	Kent Overstreet, Pavel Machek

On 8 March 2017 at 17:40, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> On Wed, 8 Mar 2017, NeilBrown wrote:
> > I don't think this will fix the DM snapshot deadlock by itself.
> > Rather, it make it possible for some internal changes to DM to fix it.
> > The DM change might be something vaguely like:
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3086da5664f3..06ee0960e415 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
> >
> >       len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
> >
> > +     if (len < ci->sector_count) {
> > +             struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
>
> fs_bio_set is a shared bio set, so it is prone to deadlocks. For this
> change, we would need two bio sets per dm device, one for the split bio
> and one for the outgoing bio. (this also means having one more kernel
> thread per dm device)
>
> It would be possible to avoid having two bio sets if the incoming bio were
> the same as the outgoing bio (allocate a small structure, move bi_end_io
> and bi_private into it, replace bi_end_io and bi_private with pointers to
> device mapper and send the bio to the target driver), but it would need
> much more changes - basically rewrite the whole bio handling code in dm.c
> and in the targets.
>
> Mikulas

"back then" (see previously posted link into ML archive)
I suggested this:

...

A bit of conflict here may be that DM has all its own
split and clone and queue magic, and wants to process
"all of the bio" before returning back to generic_make_request().

To change that, __split_and_process_bio() and all its helpers
would need to learn to "push back" (pieces of) the bio they are
currently working on, and not push back via "DM_ENDIO_REQUEUE",
but by bio_list_add_head(&current->bio_lists->queue, piece_to_be_done_later).

Then, after they processed each piece,
*return* all the way up to the top-level generic_make_request(),
where the recursion-to-iteration logic would then
make sure that all deeper level bios, submitted via
recursive calls to generic_make_request() will be processed, before the
next, pushed back, piece of the "original incoming" bio.

And *not* do their own iteration over all pieces first.

Probably not as easy as dropping the while loop,
using bio_advance, and pushing that "advanced" bio back to
current->...queue?

static void __split_and_process_bio(struct mapped_device *md,
				    struct dm_table *map, struct bio *bio)
...
	ci.bio = bio;
	ci.sector_count = bio_sectors(bio);
	while (ci.sector_count && !error)
		error = __split_and_process_non_flush(&ci);
...
	error = __split_and_process_non_flush(&ci);
	if (ci.sector_count)
		bio_advance()
		bio_list_add_head(&current->bio_lists->queue, )
...

Something like that, maybe?


Needs to be adapted to this new and improved recursion-to-iteration
logic, obviously.  Would that be doable, or does device-mapper for some
reason really *need* its own iteration loop (which, because it is called
from generic_make_request(), won't be able to ever submit anything to
any device, ever, so needs all these helper threads just in case).

    Lars

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: blk: improve order of bio handling in generic_make_request()
  2017-03-08 16:40                 ` Mikulas Patocka
  2017-03-08 17:15                   ` Lars Ellenberg
@ 2017-03-09  6:08                   ` NeilBrown
  1 sibling, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-09  6:08 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Jens Axboe, Jack Wang, LKML, Lars Ellenberg,
	Kent Overstreet, Pavel Machek

[-- Attachment #1: Type: text/plain, Size: 1178 bytes --]

On Wed, Mar 08 2017, Mikulas Patocka wrote:

> On Wed, 8 Mar 2017, NeilBrown wrote:
>> 
>> I don't think this will fix the DM snapshot deadlock by itself.
>> Rather, it make it possible for some internal changes to DM to fix it.
>> The DM change might be something vaguely like:
>> 
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 3086da5664f3..06ee0960e415 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
>> 
>>  	len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
>> 
>> +	if (len < ci->sector_count) {
>> +		struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
>
> fs_bio_set is a shared bio set, so it is prone to deadlocks. For this 
> change, we would need two bio sets per dm device, one for the split bio 
> and one for the outgoing bio. (this also means having one more kernel 
> thread per dm device)

Yes, two local bio_sets would be best.
But we don't really need those extra kernel threads.  I'll start working
on patches to make them optional, and then to start removing them.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-07 20:38         ` NeilBrown
  (?)
@ 2017-03-10  4:32         ` NeilBrown
  2017-03-10  4:33           ` [PATCH 1/5 v3] " NeilBrown
                             ` (5 more replies)
  -1 siblings, 6 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-10  4:32 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]


I started looking further at the improvements we can make once
generic_make_request is fixed, and realised that I had missed an
important detail in this patch.
Several places test current->bio_list, and two actually edit the list.
With this change, that cannot see the whole lists, so it could cause a
regression.

So I've revised the patch to make sure that the entire list of queued
bios remains visible, and change the relevant code to look at both
the new list and the old list.

Following that there are some patches which make the rescuer thread
optional, and then starts removing it from some easy-to-fix places.

The series summary is below.

NeilBrown


NeilBrown (5):
      blk: improve order of bio handling in generic_make_request()
      blk: remove bio_set arg from blk_queue_split()
      blk: make the bioset rescue_workqueue optional.
      blk: use non-rescuing bioset for q->bio_split.
      block_dev: make blkdev_dio_pool a non-rescuing bioset


 block/bio.c                         |   39 +++++++++++++++++++++++++++------
 block/blk-core.c                    |   42 ++++++++++++++++++++++++++++-------
 block/blk-merge.c                   |    7 +++---
 block/blk-mq.c                      |    4 ++-
 drivers/block/drbd/drbd_main.c      |    2 +-
 drivers/block/drbd/drbd_req.c       |    2 +-
 drivers/block/pktcdvd.c             |    2 +-
 drivers/block/ps3vram.c             |    2 +-
 drivers/block/rsxx/dev.c            |    2 +-
 drivers/block/umem.c                |    2 +-
 drivers/block/zram/zram_drv.c       |    2 +-
 drivers/lightnvm/rrpc.c             |    2 +-
 drivers/md/bcache/super.c           |    4 ++-
 drivers/md/dm-crypt.c               |    2 +-
 drivers/md/dm-io.c                  |    2 +-
 drivers/md/dm.c                     |   32 +++++++++++++++------------
 drivers/md/md.c                     |    4 ++-
 drivers/md/raid10.c                 |    3 ++-
 drivers/md/raid5-cache.c            |    2 +-
 drivers/s390/block/dcssblk.c        |    2 +-
 drivers/s390/block/xpram.c          |    2 +-
 drivers/target/target_core_iblock.c |    2 +-
 fs/btrfs/extent_io.c                |    4 ++-
 fs/xfs/xfs_super.c                  |    2 +-
 include/linux/bio.h                 |    2 ++
 include/linux/blkdev.h              |    3 +--
 26 files changed, 114 insertions(+), 60 deletions(-)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/5 v3] blk: improve order of bio handling in generic_make_request()
  2017-03-10  4:32         ` NeilBrown
@ 2017-03-10  4:33           ` NeilBrown
  2017-03-10  4:34           ` [PATCH 2/5] blk: remove bio_set arg from blk_queue_split() NeilBrown
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-10  4:33 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 9493 bytes --]


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Note that as some drivers look inside the bio_list, sometimes to punt
some bios to rescuer threads, we need to make both the pendind list and the
new list visible.  So current->bio_list is now an array of 2 lists,
and relevant code examines both of them.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c         |   11 ++++++++---
 block/blk-core.c    |   40 ++++++++++++++++++++++++++++++++--------
 drivers/md/dm.c     |   29 ++++++++++++++++-------------
 drivers/md/raid10.c |    3 ++-
 4 files changed, 58 insertions(+), 25 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e08417f..84ae39f06f81 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -376,10 +376,13 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	bio_list_init(&punt);
 	bio_list_init(&nopunt);
 
-	while ((bio = bio_list_pop(current->bio_list)))
+	while ((bio = bio_list_pop(&current->bio_list[0])))
 		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_list[0] = nopunt;
 
-	*current->bio_list = nopunt;
+	while ((bio = bio_list_pop(&current->bio_list[1])))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_list[1] = nopunt;
 
 	spin_lock(&bs->rescue_lock);
 	bio_list_merge(&bs->rescue_list, &punt);
@@ -466,7 +469,9 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 * we retry with the original gfp_flags.
 		 */
 
-		if (current->bio_list && !bio_list_empty(current->bio_list))
+		if (current->bio_list &&
+		    (!bio_list_empty(&current->bio_list[0]) ||
+		     !bio_list_empty(&current->bio_list[1])))
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
diff --git a/block/blk-core.c b/block/blk-core.c
index 1086dac8724c..bd2cb4ba674e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	/*
+	 * bio_list_on_stack[0] contains bios submitted by the current
+	 * make_request_fn.
+	 * bio_list_on_stack[1] contains bios that were submitted before
+	 * the current make_request_fn, but that haven't been processed
+	 * yet.
+	 */
+	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -1992,7 +1999,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * should be added at the tail
 	 */
 	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+		bio_list_add(&current->bio_list[0], bio);
 		goto out;
 	}
 
@@ -2011,23 +2018,40 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	bio_list_init(&bio_list_on_stack[0]);
+	bio_list_init(&bio_list_on_stack[1]);
+	current->bio_list = bio_list_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list lower, same;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_on_stack[1] = bio_list_on_stack[0];
+			bio_list_init(&bio_list_on_stack[0]);
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
-			bio = bio_list_pop(current->bio_list);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack[0], &lower);
+			bio_list_merge(&bio_list_on_stack[0], &same);
+			bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(&bio_list_on_stack[0]);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f4ffd1eb8f44..dfb75979e455 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -989,26 +989,29 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
 	struct dm_offload *o = container_of(cb, struct dm_offload, cb);
 	struct bio_list list;
 	struct bio *bio;
+	int i;
 
 	INIT_LIST_HEAD(&o->cb.list);
 
 	if (unlikely(!current->bio_list))
 		return;
 
-	list = *current->bio_list;
-	bio_list_init(current->bio_list);
-
-	while ((bio = bio_list_pop(&list))) {
-		struct bio_set *bs = bio->bi_pool;
-		if (unlikely(!bs) || bs == fs_bio_set) {
-			bio_list_add(current->bio_list, bio);
-			continue;
+	for (i = 0; i < 2; i++) {
+		list = current->bio_list[i];
+		bio_list_init(&current->bio_list[i]);
+
+		while ((bio = bio_list_pop(&list))) {
+			struct bio_set *bs = bio->bi_pool;
+			if (unlikely(!bs) || bs == fs_bio_set) {
+				bio_list_add(&current->bio_list[i], bio);
+				continue;
+			}
+
+			spin_lock(&bs->rescue_lock);
+			bio_list_add(&bs->rescue_list, bio);
+			queue_work(bs->rescue_workqueue, &bs->rescue_work);
+			spin_unlock(&bs->rescue_lock);
 		}
-
-		spin_lock(&bs->rescue_lock);
-		bio_list_add(&bs->rescue_list, bio);
-		queue_work(bs->rescue_workqueue, &bs->rescue_work);
-		spin_unlock(&bs->rescue_lock);
 	}
 }
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 063c43d83b72..0536658c9d40 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -974,7 +974,8 @@ static void wait_barrier(struct r10conf *conf)
 				    !conf->barrier ||
 				    (atomic_read(&conf->nr_pending) &&
 				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     (!bio_list_empty(&current->bio_list[0]) ||
+				      !bio_list_empty(&current->bio_list[1]))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 		if (!conf->nr_waiting)



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/5] blk: remove bio_set arg from blk_queue_split()
  2017-03-10  4:32         ` NeilBrown
  2017-03-10  4:33           ` [PATCH 1/5 v3] " NeilBrown
@ 2017-03-10  4:34           ` NeilBrown
  2017-03-10  4:35           ` [PATCH 3/5] blk: make the bioset rescue_workqueue optional NeilBrown
                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-10  4:34 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 8745 bytes --]


blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c              |    2 +-
 block/blk-merge.c             |    7 +++----
 block/blk-mq.c                |    4 ++--
 drivers/block/drbd/drbd_req.c |    2 +-
 drivers/block/pktcdvd.c       |    2 +-
 drivers/block/ps3vram.c       |    2 +-
 drivers/block/rsxx/dev.c      |    2 +-
 drivers/block/umem.c          |    2 +-
 drivers/block/zram/zram_drv.c |    2 +-
 drivers/lightnvm/rrpc.c       |    2 +-
 drivers/md/md.c               |    2 +-
 drivers/s390/block/dcssblk.c  |    2 +-
 drivers/s390/block/xpram.c    |    2 +-
 include/linux/blkdev.h        |    3 +--
 14 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bd2cb4ba674e..375006c94c15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,7 +1637,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 */
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
 		bio->bi_error = -EIO;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..ce8838aff7f7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -188,8 +188,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return do_split ? new : NULL;
 }
 
-void blk_queue_split(struct request_queue *q, struct bio **bio,
-		     struct bio_set *bs)
+void blk_queue_split(struct request_queue *q, struct bio **bio)
 {
 	struct bio *split, *res;
 	unsigned nsegs;
@@ -197,14 +196,14 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 	switch (bio_op(*bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
-		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
+		split = blk_bio_discard_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	case REQ_OP_WRITE_ZEROES:
 		split = NULL;
 		nsegs = (*bio)->bi_phys_segments;
 		break;
 	case REQ_OP_WRITE_SAME:
-		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
+		split = blk_bio_write_same_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	default:
 		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b2fd175e84d7..ef7b289e873c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1502,7 +1502,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return BLK_QC_T_NONE;
 	}
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (!is_flush_fua && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
@@ -1624,7 +1624,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return BLK_QC_T_NONE;
 	}
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (!is_flush_fua && !blk_queue_nomerges(q)) {
 		if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 652114ae1a8a..f6ed6f7f5ab2 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1554,7 +1554,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio)
 	struct drbd_device *device = (struct drbd_device *) q->queuedata;
 	unsigned long start_jif;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 66d846ba85a9..98394d034c29 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2414,7 +2414,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	pd = q->queuedata;
 	if (!pd) {
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 456b4fe21559..48072c0c1010 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -606,7 +606,7 @@ static blk_qc_t ps3vram_make_request(struct request_queue *q, struct bio *bio)
 
 	dev_dbg(&dev->core, "%s\n", __func__);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	spin_lock_irq(&priv->lock);
 	busy = !bio_list_empty(&priv->list);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..8c8024f616ae 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -151,7 +151,7 @@ static blk_qc_t rsxx_make_request(struct request_queue *q, struct bio *bio)
 	struct rsxx_bio_meta *bio_meta;
 	int st = -EINVAL;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	might_sleep();
 
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index c141cc3be22b..c8d8a2f16f8e 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -529,7 +529,7 @@ static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio)
 		 (unsigned long long)bio->bi_iter.bi_sector,
 		 bio->bi_iter.bi_size);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	spin_lock_irq(&card->lock);
 	*card->biotail = bio;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e27d89a36c34..973dac2a9a99 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -880,7 +880,7 @@ static blk_qc_t zram_make_request(struct request_queue *queue, struct bio *bio)
 {
 	struct zram *zram = queue->queuedata;
 
-	blk_queue_split(queue, &bio, queue->bio_split);
+	blk_queue_split(queue, &bio);
 
 	if (!valid_io_request(zram, bio->bi_iter.bi_sector,
 					bio->bi_iter.bi_size)) {
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index e00b1d7b976f..701fa2fafbb2 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -999,7 +999,7 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio)
 	struct nvm_rq *rqd;
 	int err;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_op(bio) == REQ_OP_DISCARD) {
 		rrpc_discard(rrpc, bio);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 548d1b8014f8..d7d2bb51a58d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -253,7 +253,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	unsigned int sectors;
 	int cpu;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (mddev == NULL || mddev->pers == NULL) {
 		bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 415d10a67b7a..10ece6f3c7eb 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -829,7 +829,7 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long source_addr;
 	unsigned long bytes_done;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	bytes_done = 0;
 	dev_info = bio->bi_bdev->bd_disk->private_data;
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index b9d7e755c8a3..a48f0d40c1d2 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,7 +190,7 @@ static blk_qc_t xpram_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long page_addr;
 	unsigned long bytes;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if ((bio->bi_iter.bi_sector & 7) != 0 ||
 	    (bio->bi_iter.bi_size & 4095) != 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 796016e63c1d..63008a246403 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -934,8 +934,7 @@ extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern int blk_rq_append_bio(struct request *rq, struct bio *bio);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
-extern void blk_queue_split(struct request_queue *, struct bio **,
-			    struct bio_set *);
+extern void blk_queue_split(struct request_queue *, struct bio **);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/5] blk: make the bioset rescue_workqueue optional.
  2017-03-10  4:32         ` NeilBrown
  2017-03-10  4:33           ` [PATCH 1/5 v3] " NeilBrown
  2017-03-10  4:34           ` [PATCH 2/5] blk: remove bio_set arg from blk_queue_split() NeilBrown
@ 2017-03-10  4:35           ` NeilBrown
  2017-03-10  4:36           ` [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split NeilBrown
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-10  4:35 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 11485 bytes --]


This patch converts bioset_create() and bioset_create_nobvec()
to not create a workqueue so alloctions will never trigger
punt_bios_to_rescuer().
It also introduces bioset_create_rescued() and bioset_create_nobvec_rescued()
which preserve the old behaviour.

*All* callers of bioset_create() and bioset_create_nobvec() are
converted to the _rescued() version, so that not change in behaviour
is experienced.

It is hoped that most, if not all, bioset can end up being the
non-rescued version.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c                         |   30 +++++++++++++++++++++++++-----
 block/blk-core.c                    |    2 +-
 drivers/block/drbd/drbd_main.c      |    2 +-
 drivers/md/bcache/super.c           |    4 ++--
 drivers/md/dm-crypt.c               |    2 +-
 drivers/md/dm-io.c                  |    2 +-
 drivers/md/dm.c                     |    5 +++--
 drivers/md/md.c                     |    2 +-
 drivers/md/raid5-cache.c            |    2 +-
 drivers/target/target_core_iblock.c |    2 +-
 fs/block_dev.c                      |    2 +-
 fs/btrfs/extent_io.c                |    4 ++--
 fs/xfs/xfs_super.c                  |    2 +-
 include/linux/bio.h                 |    2 ++
 14 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 84ae39f06f81..06587f1119f5 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -362,6 +362,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	struct bio_list punt, nopunt;
 	struct bio *bio;
 
+	if (!WARN_ON_ONCE(!bs->rescue_workqueue))
+		return;
 	/*
 	 * In order to guarantee forward progress we must punt only bios that
 	 * were allocated from this bio_set; otherwise, if there was a bio on
@@ -471,7 +473,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 		if (current->bio_list &&
 		    (!bio_list_empty(&current->bio_list[0]) ||
-		     !bio_list_empty(&current->bio_list[1])))
+		     !bio_list_empty(&current->bio_list[1])) &&
+		    bs->rescue_workqueue)
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1940,7 +1943,8 @@ EXPORT_SYMBOL(bioset_free);
 
 static struct bio_set *__bioset_create(unsigned int pool_size,
 				       unsigned int front_pad,
-				       bool create_bvec_pool)
+				       bool create_bvec_pool,
+				       bool create_rescue_workqueue)
 {
 	unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
 	struct bio_set *bs;
@@ -1971,6 +1975,9 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
 			goto bad;
 	}
 
+	if (!create_rescue_workqueue)
+		return bs;
+
 	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
 	if (!bs->rescue_workqueue)
 		goto bad;
@@ -1996,10 +2003,16 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
  */
 struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 {
-	return __bioset_create(pool_size, front_pad, true);
+	return __bioset_create(pool_size, front_pad, true, false);
 }
 EXPORT_SYMBOL(bioset_create);
 
+struct bio_set *bioset_create_rescued(unsigned int pool_size, unsigned int front_pad)
+{
+	return __bioset_create(pool_size, front_pad, true, true);
+}
+EXPORT_SYMBOL(bioset_create_rescued);
+
 /**
  * bioset_create_nobvec  - Create a bio_set without bio_vec mempool
  * @pool_size:	Number of bio to cache in the mempool
@@ -2011,10 +2024,17 @@ EXPORT_SYMBOL(bioset_create);
  */
 struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_pad)
 {
-	return __bioset_create(pool_size, front_pad, false);
+	return __bioset_create(pool_size, front_pad, false, false);
 }
 EXPORT_SYMBOL(bioset_create_nobvec);
 
+struct bio_set *bioset_create_nobvec_rescued(unsigned int pool_size,
+					     unsigned int front_pad)
+{
+	return __bioset_create(pool_size, front_pad, false, true);
+}
+EXPORT_SYMBOL(bioset_create_nobvec_rescued);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
@@ -2129,7 +2149,7 @@ static int __init init_bio(void)
 	bio_integrity_init();
 	biovec_init_slabs();
 
-	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+	fs_bio_set = bioset_create_rescued(BIO_POOL_SIZE, 0);
 	if (!fs_bio_set)
 		panic("bio: can't allocate bios\n");
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 375006c94c15..c3992d17dc2c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -714,7 +714,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
 	if (!q->bio_split)
 		goto fail_id;
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..2c69c2ab0fff 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2166,7 +2166,7 @@ static int drbd_create_mempools(void)
 		goto Enomem;
 
 	/* mempools */
-	drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0);
+	drbd_md_io_bio_set = bioset_create_rescued(DRBD_MIN_POOL_PAGES, 0);
 	if (drbd_md_io_bio_set == NULL)
 		goto Enomem;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85e3f21c2514..6cb30792f0ed 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -786,7 +786,7 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 
 	minor *= BCACHE_MINORS;
 
-	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+	if (!(d->bio_split = bioset_create_rescued(4, offsetof(struct bbio, bio))) ||
 	    !(d->disk = alloc_disk(BCACHE_MINORS))) {
 		ida_simple_remove(&bcache_minor, minor);
 		return -ENOMEM;
@@ -1520,7 +1520,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 				sizeof(struct bbio) + sizeof(struct bio_vec) *
 				bucket_pages(c))) ||
 	    !(c->fill_iter = mempool_create_kmalloc_pool(1, iter_size)) ||
-	    !(c->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+	    !(c->bio_split = bioset_create_rescued(4, offsetof(struct bbio, bio))) ||
 	    !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
 	    !(c->moving_gc_wq = alloc_workqueue("bcache_gc",
 						WQ_MEM_RECLAIM, 0)) ||
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 389a3637ffcc..91a2d637d44f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1936,7 +1936,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 	}
 
-	cc->bs = bioset_create(MIN_IOS, 0);
+	cc->bs = bioset_create_rescued(MIN_IOS, 0);
 	if (!cc->bs) {
 		ti->error = "Cannot allocate crypt bioset";
 		goto bad;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..fe1241c196b1 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -58,7 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
 	if (!client->pool)
 		goto bad;
 
-	client->bios = bioset_create(min_ios, 0);
+	client->bios = bioset_create_rescued(min_ios, 0);
 	if (!client->bios)
 		goto bad;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..41b1f033841f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1002,7 +1002,8 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
 
 		while ((bio = bio_list_pop(&list))) {
 			struct bio_set *bs = bio->bi_pool;
-			if (unlikely(!bs) || bs == fs_bio_set) {
+			if (unlikely(!bs) || bs == fs_bio_set ||
+			    !bs->rescue_workqueue) {
 				bio_list_add(&current->bio_list[i], bio);
 				continue;
 			}
@@ -2577,7 +2578,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
 		BUG();
 	}
 
-	pools->bs = bioset_create_nobvec(pool_size, front_pad);
+	pools->bs = bioset_create_nobvec_rescued(pool_size, front_pad);
 	if (!pools->bs)
 		goto out;
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index d7d2bb51a58d..e5f08a195837 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5220,7 +5220,7 @@ int md_run(struct mddev *mddev)
 	}
 
 	if (mddev->bio_set == NULL) {
-		mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);
+		mddev->bio_set = bioset_create_rescued(BIO_POOL_SIZE, 0);
 		if (!mddev->bio_set)
 			return -ENOMEM;
 	}
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be01b10..c95c6c046395 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2831,7 +2831,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->io_pool)
 		goto io_pool;
 
-	log->bs = bioset_create(R5L_POOL_SIZE, 0);
+	log->bs = bioset_create_rescued(R5L_POOL_SIZE, 0);
 	if (!log->bs)
 		goto io_bs;
 
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d316ed537d59..5bf3392195c6 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -93,7 +93,7 @@ static int iblock_configure_device(struct se_device *dev)
 		return -EINVAL;
 	}
 
-	ib_dev->ibd_bio_set = bioset_create(IBLOCK_BIO_POOL_SIZE, 0);
+	ib_dev->ibd_bio_set = bioset_create_rescued(IBLOCK_BIO_POOL_SIZE, 0);
 	if (!ib_dev->ibd_bio_set) {
 		pr_err("IBLOCK: Unable to create bioset\n");
 		goto out;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..c0ca5f0d0369 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,7 +436,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 static __init int blkdev_init(void)
 {
-	blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
+	blkdev_dio_pool = bioset_create_rescued(4, offsetof(struct blkdev_dio, bio));
 	if (!blkdev_dio_pool)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e81922a21c..34aa8893790a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -173,8 +173,8 @@ int __init extent_io_init(void)
 	if (!extent_buffer_cache)
 		goto free_state_cache;
 
-	btrfs_bioset = bioset_create(BIO_POOL_SIZE,
-				     offsetof(struct btrfs_io_bio, bio));
+	btrfs_bioset = bioset_create_rescued(BIO_POOL_SIZE,
+					     offsetof(struct btrfs_io_bio, bio));
 	if (!btrfs_bioset)
 		goto free_buffer_cache;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 890862f2447c..f4c4d6f41d91 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1756,7 +1756,7 @@ MODULE_ALIAS_FS("xfs");
 STATIC int __init
 xfs_init_zones(void)
 {
-	xfs_ioend_bioset = bioset_create(4 * MAX_BUF_PER_PAGE,
+	xfs_ioend_bioset = bioset_create_rescued(4 * MAX_BUF_PER_PAGE,
 			offsetof(struct xfs_ioend, io_inline_bio));
 	if (!xfs_ioend_bioset)
 		goto out;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..05730603fcf1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -379,7 +379,9 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors,
 }
 
 extern struct bio_set *bioset_create(unsigned int, unsigned int);
+extern struct bio_set *bioset_create_rescued(unsigned int, unsigned int);
 extern struct bio_set *bioset_create_nobvec(unsigned int, unsigned int);
+extern struct bio_set *bioset_create_nobvec_rescued(unsigned int, unsigned int);
 extern void bioset_free(struct bio_set *);
 extern mempool_t *biovec_create_pool(int pool_entries);
 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split.
  2017-03-10  4:32         ` NeilBrown
                             ` (2 preceding siblings ...)
  2017-03-10  4:35           ` [PATCH 3/5] blk: make the bioset rescue_workqueue optional NeilBrown
@ 2017-03-10  4:36           ` NeilBrown
  2017-03-10  4:37           ` [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset NeilBrown
  2017-03-10  4:38           ` [PATCH v2] blk: improve order of bio handling in generic_make_request() Jens Axboe
  5 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-10  4:36 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 1967 bytes --]


A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called.  This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s.  The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled.  None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly.  By this time there
cannot be and other bios allocated fro q->bio_split in the
generic_make_request() queue.  So no rescuing will ever be
needed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c3992d17dc2c..375006c94c15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -714,7 +714,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
 	if (!q->bio_split)
 		goto fail_id;
 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset
  2017-03-10  4:32         ` NeilBrown
                             ` (3 preceding siblings ...)
  2017-03-10  4:36           ` [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split NeilBrown
@ 2017-03-10  4:37           ` NeilBrown
  2017-03-10  4:38           ` [PATCH v2] blk: improve order of bio handling in generic_make_request() Jens Axboe
  5 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-10  4:37 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 727 bytes --]


Allocations from blkdev_dio_pool are never made under
generic_make_request, so this bioset does not need a rescuer thread.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/block_dev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c0ca5f0d0369..2eca00ec4370 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,7 +436,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 static __init int blkdev_init(void)
 {
-	blkdev_dio_pool = bioset_create_rescued(4, offsetof(struct blkdev_dio, bio));
+	blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
 	if (!blkdev_dio_pool)
 		return -ENOMEM;
 	return 0;



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10  4:32         ` NeilBrown
                             ` (4 preceding siblings ...)
  2017-03-10  4:37           ` [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset NeilBrown
@ 2017-03-10  4:38           ` Jens Axboe
  2017-03-10  4:40             ` Jens Axboe
  2017-03-10  5:19             ` NeilBrown
  5 siblings, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2017-03-10  4:38 UTC (permalink / raw)
  To: NeilBrown, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

On 03/09/2017 09:32 PM, NeilBrown wrote:
> 
> I started looking further at the improvements we can make once
> generic_make_request is fixed, and realised that I had missed an
> important detail in this patch.
> Several places test current->bio_list, and two actually edit the list.
> With this change, that cannot see the whole lists, so it could cause a
> regression.
> 
> So I've revised the patch to make sure that the entire list of queued
> bios remains visible, and change the relevant code to look at both
> the new list and the old list.
> 
> Following that there are some patches which make the rescuer thread
> optional, and then starts removing it from some easy-to-fix places.

Neil, note that the v2 patch is already in Linus tree as of earlier
today. You need to rebase the series, and if we need fixups on
top of v2, then that should be done separately and with increased
urgency.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10  4:38           ` [PATCH v2] blk: improve order of bio handling in generic_make_request() Jens Axboe
@ 2017-03-10  4:40             ` Jens Axboe
  2017-03-10  5:19             ` NeilBrown
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-03-10  4:40 UTC (permalink / raw)
  To: NeilBrown, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

On 03/09/2017 09:38 PM, Jens Axboe wrote:
> On 03/09/2017 09:32 PM, NeilBrown wrote:
>>
>> I started looking further at the improvements we can make once
>> generic_make_request is fixed, and realised that I had missed an
>> important detail in this patch.
>> Several places test current->bio_list, and two actually edit the list.
>> With this change, that cannot see the whole lists, so it could cause a
>> regression.
>>
>> So I've revised the patch to make sure that the entire list of queued
>> bios remains visible, and change the relevant code to look at both
>> the new list and the old list.
>>
>> Following that there are some patches which make the rescuer thread
>> optional, and then starts removing it from some easy-to-fix places.
> 
> Neil, note that the v2 patch is already in Linus tree as of earlier
> today. You need to rebase the series, and if we need fixups on
> top of v2, then that should be done separately and with increased
> urgency.

Additionally, at least the first patch appears to be badly mangled.
The formatting is screwed up.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10  4:38           ` [PATCH v2] blk: improve order of bio handling in generic_make_request() Jens Axboe
  2017-03-10  4:40             ` Jens Axboe
@ 2017-03-10  5:19             ` NeilBrown
  2017-03-10 12:34               ` Lars Ellenberg
  1 sibling, 1 reply; 37+ messages in thread
From: NeilBrown @ 2017-03-10  5:19 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]

On Thu, Mar 09 2017, Jens Axboe wrote:

> On 03/09/2017 09:32 PM, NeilBrown wrote:
>> 
>> I started looking further at the improvements we can make once
>> generic_make_request is fixed, and realised that I had missed an
>> important detail in this patch.
>> Several places test current->bio_list, and two actually edit the list.
>> With this change, that cannot see the whole lists, so it could cause a
>> regression.
>> 
>> So I've revised the patch to make sure that the entire list of queued
>> bios remains visible, and change the relevant code to look at both
>> the new list and the old list.
>> 
>> Following that there are some patches which make the rescuer thread
>> optional, and then starts removing it from some easy-to-fix places.
>
> Neil, note that the v2 patch is already in Linus tree as of earlier
> today. You need to rebase the series, and if we need fixups on
> top of v2, then that should be done separately and with increased
> urgency.

I had checked linux-next, but not the latest from Linus.
I see it now - thanks!
I'll rebase (and ensure nothing gets mangled)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10  5:19             ` NeilBrown
@ 2017-03-10 12:34               ` Lars Ellenberg
  2017-03-10 14:38                 ` Mike Snitzer
  2017-03-11  0:47                   ` NeilBrown
  0 siblings, 2 replies; 37+ messages in thread
From: Lars Ellenberg @ 2017-03-10 12:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Jack Wang, LKML, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -       struct bio_list bio_list_on_stack;
> +       /*
> +        * bio_list_on_stack[0] contains bios submitted by the current
> +        * make_request_fn.
> +        * bio_list_on_stack[1] contains bios that were submitted before
> +        * the current make_request_fn, but that haven't been processed
> +        * yet.
> +        */
> +       struct bio_list bio_list_on_stack[2];
>         blk_qc_t ret = BLK_QC_T_NONE;

May I suggest that, if you intend to assign something that is not a
plain &(struct bio_list), but a &(struct bio_list[2]),
you change the task member so it is renamed (current->bio_list vs
current->bio_lists, plural, is what I did last year).
Or you will break external modules, silently, and horribly (or,
rather, they won't notice, but break the kernel).
Examples of such modules would be DRBD, ZFS, quite possibly others.

Thanks,

    Lars

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10 12:34               ` Lars Ellenberg
@ 2017-03-10 14:38                 ` Mike Snitzer
  2017-03-10 14:55                   ` Mikulas Patocka
  2017-03-11  0:47                   ` NeilBrown
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2017-03-10 14:38 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: NeilBrown, Jens Axboe, Jack Wang, LKML, Kent Overstreet,
	Pavel Machek, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

On Fri, Mar 10 2017 at  7:34am -0500,
Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >   */
> >  blk_qc_t generic_make_request(struct bio *bio)
> >  {
> > -       struct bio_list bio_list_on_stack;
> > +       /*
> > +        * bio_list_on_stack[0] contains bios submitted by the current
> > +        * make_request_fn.
> > +        * bio_list_on_stack[1] contains bios that were submitted before
> > +        * the current make_request_fn, but that haven't been processed
> > +        * yet.
> > +        */
> > +       struct bio_list bio_list_on_stack[2];
> >         blk_qc_t ret = BLK_QC_T_NONE;
> 
> May I suggest that, if you intend to assign something that is not a
> plain &(struct bio_list), but a &(struct bio_list[2]),
> you change the task member so it is renamed (current->bio_list vs
> current->bio_lists, plural, is what I did last year).
> Or you will break external modules, silently, and horribly (or,
> rather, they won't notice, but break the kernel).
> Examples of such modules would be DRBD, ZFS, quite possibly others.

drbd is upstream -- so what is the problem?  (if you are having to
distribute drbd independent of the upstream drbd then why is drbd
upstream?)

As for ZFS, worrying about ZFS kABI breakage is the last thing we should
be doing.

So Nack from me on this defensive make-work for external modules.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10 14:38                 ` Mike Snitzer
@ 2017-03-10 14:55                   ` Mikulas Patocka
  2017-03-10 15:07                     ` Jack Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Mikulas Patocka @ 2017-03-10 14:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Lars Ellenberg, NeilBrown, Jens Axboe, Jack Wang, LKML,
	Kent Overstreet, Pavel Machek, linux-raid,
	device-mapper development, linux-block



On Fri, 10 Mar 2017, Mike Snitzer wrote:

> On Fri, Mar 10 2017 at  7:34am -0500,
> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> 
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> > >   */
> > >  blk_qc_t generic_make_request(struct bio *bio)
> > >  {
> > > -       struct bio_list bio_list_on_stack;
> > > +       /*
> > > +        * bio_list_on_stack[0] contains bios submitted by the current
> > > +        * make_request_fn.
> > > +        * bio_list_on_stack[1] contains bios that were submitted before
> > > +        * the current make_request_fn, but that haven't been processed
> > > +        * yet.
> > > +        */
> > > +       struct bio_list bio_list_on_stack[2];
> > >         blk_qc_t ret = BLK_QC_T_NONE;
> > 
> > May I suggest that, if you intend to assign something that is not a
> > plain &(struct bio_list), but a &(struct bio_list[2]),
> > you change the task member so it is renamed (current->bio_list vs
> > current->bio_lists, plural, is what I did last year).
> > Or you will break external modules, silently, and horribly (or,
> > rather, they won't notice, but break the kernel).
> > Examples of such modules would be DRBD, ZFS, quite possibly others.
> 
> drbd is upstream -- so what is the problem?  (if you are having to
> distribute drbd independent of the upstream drbd then why is drbd
> upstream?)
> 
> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
> be doing.

It's better to make external modules not compile than to silently 
introduce bugs in them. So yes, I would rename that.

Mikulas

> So Nack from me on this defensive make-work for external modules.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10 14:55                   ` Mikulas Patocka
@ 2017-03-10 15:07                     ` Jack Wang
  2017-03-10 15:35                       ` Mike Snitzer
  2017-03-10 18:51                       ` Lars Ellenberg
  0 siblings, 2 replies; 37+ messages in thread
From: Jack Wang @ 2017-03-10 15:07 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Lars Ellenberg, NeilBrown, Jens Axboe, LKML, Kent Overstreet,
	Pavel Machek, linux-raid, device-mapper development, linux-block



On 10.03.2017 15:55, Mikulas Patocka wrote:
> 
> 
> On Fri, 10 Mar 2017, Mike Snitzer wrote:
> 
>> On Fri, Mar 10 2017 at  7:34am -0500,
>> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
>>
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>>>>   */
>>>>  blk_qc_t generic_make_request(struct bio *bio)
>>>>  {
>>>> -       struct bio_list bio_list_on_stack;
>>>> +       /*
>>>> +        * bio_list_on_stack[0] contains bios submitted by the current
>>>> +        * make_request_fn.
>>>> +        * bio_list_on_stack[1] contains bios that were submitted before
>>>> +        * the current make_request_fn, but that haven't been processed
>>>> +        * yet.
>>>> +        */
>>>> +       struct bio_list bio_list_on_stack[2];
>>>>         blk_qc_t ret = BLK_QC_T_NONE;
>>>
>>> May I suggest that, if you intend to assign something that is not a
>>> plain &(struct bio_list), but a &(struct bio_list[2]),
>>> you change the task member so it is renamed (current->bio_list vs
>>> current->bio_lists, plural, is what I did last year).
>>> Or you will break external modules, silently, and horribly (or,
>>> rather, they won't notice, but break the kernel).
>>> Examples of such modules would be DRBD, ZFS, quite possibly others.
>>
>> drbd is upstream -- so what is the problem?  (if you are having to
>> distribute drbd independent of the upstream drbd then why is drbd
>> upstream?)
>>
>> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
>> be doing.
> 
> It's better to make external modules not compile than to silently 
> introduce bugs in them. So yes, I would rename that.
> 
> Mikulas

Agree, better rename current->bio_list to current->bio_lists

Regards,
Jack

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10 15:07                     ` Jack Wang
@ 2017-03-10 15:35                       ` Mike Snitzer
  2017-03-10 18:51                       ` Lars Ellenberg
  1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2017-03-10 15:35 UTC (permalink / raw)
  To: Jack Wang
  Cc: Mikulas Patocka, Lars Ellenberg, NeilBrown, Jens Axboe, LKML,
	Kent Overstreet, Pavel Machek, linux-raid,
	device-mapper development, linux-block

On Fri, Mar 10 2017 at 10:07am -0500,
Jack Wang <jinpu.wang@profitbricks.com> wrote:

> 
> 
> On 10.03.2017 15:55, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 10 Mar 2017, Mike Snitzer wrote:
> > 
> >> On Fri, Mar 10 2017 at  7:34am -0500,
> >> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> >>
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >>>>   */
> >>>>  blk_qc_t generic_make_request(struct bio *bio)
> >>>>  {
> >>>> -       struct bio_list bio_list_on_stack;
> >>>> +       /*
> >>>> +        * bio_list_on_stack[0] contains bios submitted by the current
> >>>> +        * make_request_fn.
> >>>> +        * bio_list_on_stack[1] contains bios that were submitted before
> >>>> +        * the current make_request_fn, but that haven't been processed
> >>>> +        * yet.
> >>>> +        */
> >>>> +       struct bio_list bio_list_on_stack[2];
> >>>>         blk_qc_t ret = BLK_QC_T_NONE;
> >>>
> >>> May I suggest that, if you intend to assign something that is not a
> >>> plain &(struct bio_list), but a &(struct bio_list[2]),
> >>> you change the task member so it is renamed (current->bio_list vs
> >>> current->bio_lists, plural, is what I did last year).
> >>> Or you will break external modules, silently, and horribly (or,
> >>> rather, they won't notice, but break the kernel).
> >>> Examples of such modules would be DRBD, ZFS, quite possibly others.
> >>
> >> drbd is upstream -- so what is the problem?  (if you are having to
> >> distribute drbd independent of the upstream drbd then why is drbd
> >> upstream?)
> >>
> >> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
> >> be doing.
> > 
> > It's better to make external modules not compile than to silently 
> > introduce bugs in them. So yes, I would rename that.
> > 
> > Mikulas
> 
> Agree, better rename current->bio_list to current->bio_lists

Fine, normally wouldn't do so but I'm not so opposed that we need to get
hung up on this detail.  If Neil and Jens agree then so be it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10 15:07                     ` Jack Wang
  2017-03-10 15:35                       ` Mike Snitzer
@ 2017-03-10 18:51                       ` Lars Ellenberg
  1 sibling, 0 replies; 37+ messages in thread
From: Lars Ellenberg @ 2017-03-10 18:51 UTC (permalink / raw)
  To: Jack Wang
  Cc: Mikulas Patocka, Mike Snitzer, NeilBrown, Jens Axboe, LKML,
	Kent Overstreet, Pavel Machek, linux-raid,
	device-mapper development, linux-block

On Fri, Mar 10, 2017 at 04:07:58PM +0100, Jack Wang wrote:
> On 10.03.2017 15:55, Mikulas Patocka wrote:
> > On Fri, 10 Mar 2017, Mike Snitzer wrote:
> >> On Fri, Mar 10 2017 at  7:34am -0500,
> >> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> >>
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >>>>   */
> >>>>  blk_qc_t generic_make_request(struct bio *bio)
> >>>>  {
> >>>> -       struct bio_list bio_list_on_stack;
> >>>> +       /*
> >>>> +        * bio_list_on_stack[0] contains bios submitted by the current
> >>>> +        * make_request_fn.
> >>>> +        * bio_list_on_stack[1] contains bios that were submitted before
> >>>> +        * the current make_request_fn, but that haven't been processed
> >>>> +        * yet.
> >>>> +        */
> >>>> +       struct bio_list bio_list_on_stack[2];
> >>>>         blk_qc_t ret = BLK_QC_T_NONE;
> >>>
> >>> May I suggest that, if you intend to assign something that is not a
> >>> plain &(struct bio_list), but a &(struct bio_list[2]),
> >>> you change the task member so it is renamed (current->bio_list vs
> >>> current->bio_lists, plural, is what I did last year).
> >>> Or you will break external modules, silently, and horribly (or,
> >>> rather, they won't notice, but break the kernel).
> >>> Examples of such modules would be DRBD, ZFS, quite possibly others.

> > It's better to make external modules not compile than to silently 
> > introduce bugs in them. So yes, I would rename that.
> > 
> > Mikulas
> 
> Agree, better rename current->bio_list to current->bio_lists
>
> Regards,
> Jack

Thank you.

(I don't know if some one does, but...)
Thing is: *IF* some external thing messes with
current->bio_list in "interesting" ways, and not just the
"I don't care, one level of real recursion fixes this for me"
pattern of
	struct bio_list *tmp = current->bio_list;
	current->bio_list = NULL;
	submit_bio()
	current->bio_list = tmp;
you get a chance of stack corruption,
without even as much as a compiler warning.

Which is why I wrote https://lkml.org/lkml/2016/7/8/469
...

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
	struct recursion_to_iteration_bio_lists {
		struct bio_list recursion;
		struct bio_list queue;
	}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

...

Cheers,

	Lars

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
  2017-03-10 12:34               ` Lars Ellenberg
@ 2017-03-11  0:47                   ` NeilBrown
  2017-03-11  0:47                   ` NeilBrown
  1 sibling, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-11  0:47 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Jens Axboe, linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Kent Overstreet


[-- Attachment #1.1: Type: text/plain, Size: 3001 bytes --]

On Fri, Mar 10 2017, Lars Ellenberg wrote:

>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>>   */
>>  blk_qc_t generic_make_request(struct bio *bio)
>>  {
>> -       struct bio_list bio_list_on_stack;
>> +       /*
>> +        * bio_list_on_stack[0] contains bios submitted by the current
>> +        * make_request_fn.
>> +        * bio_list_on_stack[1] contains bios that were submitted before
>> +        * the current make_request_fn, but that haven't been processed
>> +        * yet.
>> +        */
>> +       struct bio_list bio_list_on_stack[2];
>>         blk_qc_t ret = BLK_QC_T_NONE;
>
> May I suggest that, if you intend to assign something that is not a
> plain &(struct bio_list), but a &(struct bio_list[2]),
> you change the task member so it is renamed (current->bio_list vs
> current->bio_lists, plural, is what I did last year).
> Or you will break external modules, silently, and horribly (or,
> rather, they won't notice, but break the kernel).
> Examples of such modules would be DRBD, ZFS, quite possibly others.
>

This is exactly what I didn't in my first draft (bio_list -> bio_lists),
but then I reverted that change because it didn't seem to be worth the
noise.
It isn't much noise, sched.h, bcache/btree.c, md/dm-bufio.c, and
md/raid1.c get minor changes.
But as I'm hoping to get rid of all of those uses, renaming before
removing seemed pointless ... though admittedly that is what I did for
bioset_create().... I wondered about that too.

The example you give later:
	struct bio_list *tmp = current->bio_list;
	current->bio_list = NULL;
	submit_bio()
	current->bio_list = tmp;

won't cause any problem.  Whatever lists the parent generic_make_request
is holding onto will be untouched during the submit_bio() call, and will
be exactly as it expects them when this caller returns.

If some out-of-tree code does anything with ->bio_list that makes sense
with the previous code, then it will still make sense with the new
code. However there will be a few bios that it didn't get too look at.
These will all be bios that were submitted by a device further up the
stack (closer to the filesystem), so they *should* be irrelevant.
I could probably come up with some weird behaviour that might have
worked before but now wouldn't quite work the same way.  But just fixing
bugs can sometimes affect an out-of-tree driver in a strange way because
it was assuming those bugs.

I hope that I'll soon be able to remove punt_bios_to_rescuer and
flush_current_bio_list, after which current->bio_list  can really be
just a list again.  I don't think it is worth changing the name for a
transient situation.

But thanks for the review - it encouraged me to think though the
consequences again and I'm now more confident.
I actually now think that change probably wasn't necessary.  It is
safer though.  It ensures that current functionality isn't removed
without a clear justification.

Thanks,
NeilBrown


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
@ 2017-03-11  0:47                   ` NeilBrown
  0 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2017-03-11  0:47 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Jens Axboe, Jack Wang, LKML, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]

On Fri, Mar 10 2017, Lars Ellenberg wrote:

>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>>   */
>>  blk_qc_t generic_make_request(struct bio *bio)
>>  {
>> -       struct bio_list bio_list_on_stack;
>> +       /*
>> +        * bio_list_on_stack[0] contains bios submitted by the current
>> +        * make_request_fn.
>> +        * bio_list_on_stack[1] contains bios that were submitted before
>> +        * the current make_request_fn, but that haven't been processed
>> +        * yet.
>> +        */
>> +       struct bio_list bio_list_on_stack[2];
>>         blk_qc_t ret = BLK_QC_T_NONE;
>
> May I suggest that, if you intend to assign something that is not a
> plain &(struct bio_list), but a &(struct bio_list[2]),
> you change the task member so it is renamed (current->bio_list vs
> current->bio_lists, plural, is what I did last year).
> Or you will break external modules, silently, and horribly (or,
> rather, they won't notice, but break the kernel).
> Examples of such modules would be DRBD, ZFS, quite possibly others.
>

This is exactly what I didn't in my first draft (bio_list -> bio_lists),
but then I reverted that change because it didn't seem to be worth the
noise.
It isn't much noise, sched.h, bcache/btree.c, md/dm-bufio.c, and
md/raid1.c get minor changes.
But as I'm hoping to get rid of all of those uses, renaming before
removing seemed pointless ... though admittedly that is what I did for
bioset_create().... I wondered about that too.

The example you give later:
	struct bio_list *tmp = current->bio_list;
	current->bio_list = NULL;
	submit_bio()
	current->bio_list = tmp;

won't cause any problem.  Whatever lists the parent generic_make_request
is holding onto will be untouched during the submit_bio() call, and will
be exactly as it expects them when this caller returns.

If some out-of-tree code does anything with ->bio_list that makes sense
with the previous code, then it will still make sense with the new
code. However there will be a few bios that it didn't get too look at.
These will all be bios that were submitted by a device further up the
stack (closer to the filesystem), so they *should* be irrelevant.
I could probably come up with some weird behaviour that might have
worked before but now wouldn't quite work the same way.  But just fixing
bugs can sometimes affect an out-of-tree driver in a strange way because
it was assuming those bugs.

I hope that I'll soon be able to remove punt_bios_to_rescuer and
flush_current_bio_list, after which current->bio_list  can really be
just a list again.  I don't think it is worth changing the name for a
transient situation.

But thanks for the review - it encouraged me to think though the
consequences again and I'm now more confident.
I actually now think that change probably wasn't necessary.  It is
safer though.  It ensures that current functionality isn't removed
without a clear justification.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-03-11  0:47 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-03  5:14 [PATCH] blk: improve order of bio handling in generic_make_request() NeilBrown
2017-03-03  9:28 ` Jack Wang
2017-03-06  4:40   ` NeilBrown
2017-03-06  9:43     ` Jack Wang
2017-03-07 15:46       ` Pavel Machek
2017-03-07 15:53         ` Jack Wang
2017-03-07 16:21         ` Jens Axboe
2017-03-06 20:18     ` Jens Axboe
2017-03-07  8:49       ` Jack Wang
2017-03-07 16:52         ` Mike Snitzer
2017-03-07 17:05           ` Jens Axboe
2017-03-07 17:14             ` Mike Snitzer
2017-03-07 20:29               ` NeilBrown
2017-03-07 23:01                 ` Mike Snitzer
2017-03-08 16:40                 ` Mikulas Patocka
2017-03-08 17:15                   ` Lars Ellenberg
2017-03-09  6:08                   ` NeilBrown
2017-03-08 11:46           ` Lars Ellenberg
2017-03-07 20:38       ` [PATCH v2] " NeilBrown
2017-03-07 20:38         ` NeilBrown
2017-03-10  4:32         ` NeilBrown
2017-03-10  4:33           ` [PATCH 1/5 v3] " NeilBrown
2017-03-10  4:34           ` [PATCH 2/5] blk: remove bio_set arg from blk_queue_split() NeilBrown
2017-03-10  4:35           ` [PATCH 3/5] blk: make the bioset rescue_workqueue optional NeilBrown
2017-03-10  4:36           ` [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split NeilBrown
2017-03-10  4:37           ` [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset NeilBrown
2017-03-10  4:38           ` [PATCH v2] blk: improve order of bio handling in generic_make_request() Jens Axboe
2017-03-10  4:40             ` Jens Axboe
2017-03-10  5:19             ` NeilBrown
2017-03-10 12:34               ` Lars Ellenberg
2017-03-10 14:38                 ` Mike Snitzer
2017-03-10 14:55                   ` Mikulas Patocka
2017-03-10 15:07                     ` Jack Wang
2017-03-10 15:35                       ` Mike Snitzer
2017-03-10 18:51                       ` Lars Ellenberg
2017-03-11  0:47                 ` NeilBrown
2017-03-11  0:47                   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.