All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] bio order fix backport for 4.4
@ 2017-04-03  9:41 Jack Wang
  2017-04-03  9:41 ` [PATCH 1/2] blk: improve order of bio handling in generic_make_request() Jack Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Jack Wang @ 2017-04-03  9:41 UTC (permalink / raw)
  To: stable, gregkh; +Cc: yun.wang, Jack Wang

From: Jack Wang <jinpu.wang@profitbricks.com>

Hi,

Please consider apply both fix for bio order, it fixes dead lock in
MD/DRBD.

I backported to 4.4, and also tested on 4.4.50.

Both patches apply cleanly on 4.4.59.

NeilBrown (2):
  blk: improve order of bio handling in generic_make_request()
  blk: Ensure users for current->bio_list can see the full list.

 block/bio.c         | 12 +++++++++---
 block/blk-core.c    | 40 +++++++++++++++++++++++++++++++---------
 drivers/md/raid1.c  |  3 ++-
 drivers/md/raid10.c |  3 ++-
 4 files changed, 44 insertions(+), 14 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] blk: improve order of bio handling in generic_make_request()
  2017-04-03  9:41 [PATCH 0/2] bio order fix backport for 4.4 Jack Wang
@ 2017-04-03  9:41 ` Jack Wang
  2017-04-03  9:41 ` [PATCH 2/2] blk: Ensure users for current->bio_list can see the full list Jack Wang
  2017-04-06  7:43 ` [PATCH 0/2] bio order fix backport for 4.4 Greg KH
  2 siblings, 0 replies; 5+ messages in thread
From: Jack Wang @ 2017-04-03  9:41 UTC (permalink / raw)
  To: stable, gregkh; +Cc: yun.wang, NeilBrown, Jens Axboe, Jack Wang

From: NeilBrown <neilb@suse.com>

commit 79bd99596b7305ab08109a8bf44a6a4511dbf1cd upstream.

To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
[jwang: backport to 4.4]
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/blk-core.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4fab5d6..7a58a22 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2063,18 +2063,33 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			hold = bio_list_on_stack;
+			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
-
-			bio = bio_list_pop(current->bio_list);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(current->bio_list);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] blk: Ensure users for current->bio_list can see the full list.
  2017-04-03  9:41 [PATCH 0/2] bio order fix backport for 4.4 Jack Wang
  2017-04-03  9:41 ` [PATCH 1/2] blk: improve order of bio handling in generic_make_request() Jack Wang
@ 2017-04-03  9:41 ` Jack Wang
  2017-04-06  7:43 ` [PATCH 0/2] bio order fix backport for 4.4 Greg KH
  2 siblings, 0 replies; 5+ messages in thread
From: Jack Wang @ 2017-04-03  9:41 UTC (permalink / raw)
  To: stable, gregkh; +Cc: yun.wang, NeilBrown, Jens Axboe, Jack Wang

From: NeilBrown <neilb@suse.com>

commit f5fe1b51905df7cfe4fdfd85c5fb7bc5b71a094f upstream.

Commit 90a634915e7(block: fix deadlock between freeze_array() and wait_barrier())
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.

There are two places which walk the list and requeue selected bios,
and others that check if the list is empty.  These are no longer
correct.

So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.

Signed-off-by: NeilBrown <neilb@suse.com>
Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
Signed-off-by: Jens Axboe <axboe@fb.com>
[jwang: backport to 4.4]
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/bio.c         | 12 +++++++++---
 block/blk-core.c    | 31 +++++++++++++++++++------------
 drivers/md/raid1.c  |  3 ++-
 drivers/md/raid10.c |  3 ++-
 4 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 46e2cc1..14263fa 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -373,10 +373,14 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	bio_list_init(&punt);
 	bio_list_init(&nopunt);
 
-	while ((bio = bio_list_pop(current->bio_list)))
+	while ((bio = bio_list_pop(&current->bio_list[0])))
 		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_list[0] = nopunt;
 
-	*current->bio_list = nopunt;
+	bio_list_init(&nopunt);
+	while ((bio = bio_list_pop(&current->bio_list[1])))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_list[1] = nopunt;
 
 	spin_lock(&bs->rescue_lock);
 	bio_list_merge(&bs->rescue_list, &punt);
@@ -464,7 +468,9 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 * we retry with the original gfp_flags.
 		 */
 
-		if (current->bio_list && !bio_list_empty(current->bio_list))
+		if (current->bio_list &&
+		    (!bio_list_empty(&current->bio_list[0]) ||
+		     !bio_list_empty(&current->bio_list[1])))
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
diff --git a/block/blk-core.c b/block/blk-core.c
index 7a58a22..ef083e7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2021,7 +2021,14 @@ end_io:
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	/*
+	 * bio_list_on_stack[0] contains bios submitted by the current
+	 * make_request_fn.
+	 * bio_list_on_stack[1] contains bios that were submitted before
+	 * the current make_request_fn, but that haven't been processed
+	 * yet.
+	 */
+	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -2038,7 +2045,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * should be added at the tail
 	 */
 	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+		bio_list_add(&current->bio_list[0], bio);
 		goto out;
 	}
 
@@ -2057,17 +2064,17 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	bio_list_init(&bio_list_on_stack[0]);
+	current->bio_list = bio_list_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
-			struct bio_list lower, same, hold;
+			struct bio_list lower, same;
 
 			/* Create a fresh bio_list for all subordinate requests */
-			hold = bio_list_on_stack;
-			bio_list_init(&bio_list_on_stack);
+			bio_list_on_stack[1] = bio_list_on_stack[0];
+			bio_list_init(&bio_list_on_stack[0]);
 
 			ret = q->make_request_fn(q, bio);
 
@@ -2077,19 +2084,19 @@ blk_qc_t generic_make_request(struct bio *bio)
 			 */
 			bio_list_init(&lower);
 			bio_list_init(&same);
-			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
 				if (q == bdev_get_queue(bio->bi_bdev))
 					bio_list_add(&same, bio);
 				else
 					bio_list_add(&lower, bio);
 			/* now assemble so we handle the lowest level first */
-			bio_list_merge(&bio_list_on_stack, &lower);
-			bio_list_merge(&bio_list_on_stack, &same);
-			bio_list_merge(&bio_list_on_stack, &hold);
+			bio_list_merge(&bio_list_on_stack[0], &lower);
+			bio_list_merge(&bio_list_on_stack[0], &same);
+			bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
 		} else {
 			bio_io_error(bio);
 		}
-		bio = bio_list_pop(current->bio_list);
+		bio = bio_list_pop(&bio_list_on_stack[0]);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 515554c..9be3998 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -877,7 +877,8 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
 				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
+				     (!bio_list_empty(&current->bio_list[0]) ||
+				      !bio_list_empty(&current->bio_list[1])))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a92979e..e5ee4e9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -946,7 +946,8 @@ static void wait_barrier(struct r10conf *conf)
 				    !conf->barrier ||
 				    (conf->nr_pending &&
 				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     (!bio_list_empty(&current->bio_list[0]) ||
+				      !bio_list_empty(&current->bio_list[1]))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] bio order fix backport for 4.4
  2017-04-03  9:41 [PATCH 0/2] bio order fix backport for 4.4 Jack Wang
  2017-04-03  9:41 ` [PATCH 1/2] blk: improve order of bio handling in generic_make_request() Jack Wang
  2017-04-03  9:41 ` [PATCH 2/2] blk: Ensure users for current->bio_list can see the full list Jack Wang
@ 2017-04-06  7:43 ` Greg KH
  2017-04-06  7:50   ` Jinpu Wang
  2 siblings, 1 reply; 5+ messages in thread
From: Greg KH @ 2017-04-06  7:43 UTC (permalink / raw)
  To: Jack Wang; +Cc: stable, yun.wang

On Mon, Apr 03, 2017 at 11:41:17AM +0200, Jack Wang wrote:
> From: Jack Wang <jinpu.wang@profitbricks.com>
> 
> Hi,
> 
> Please consider apply both fix for bio order, it fixes dead lock in
> MD/DRBD.
> 
> I backported to 4.4, and also tested on 4.4.50.
> 
> Both patches apply cleanly on 4.4.59.

Any reason these shouldn't also go into 4.10 and 4.9-stable?  They look
like they belong there, and I don't want patches in 4.4 that are not in
other stable trees, that wouldn't make any sense and would cause
regressions when people upgrade.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] bio order fix backport for 4.4
  2017-04-06  7:43 ` [PATCH 0/2] bio order fix backport for 4.4 Greg KH
@ 2017-04-06  7:50   ` Jinpu Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Jinpu Wang @ 2017-04-06  7:50 UTC (permalink / raw)
  To: Greg KH; +Cc: stable, Michael Wang

On Thu, Apr 6, 2017 at 9:43 AM, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Mon, Apr 03, 2017 at 11:41:17AM +0200, Jack Wang wrote:
>> From: Jack Wang <jinpu.wang@profitbricks.com>
>>
>> Hi,
>>
>> Please consider apply both fix for bio order, it fixes dead lock in
>> MD/DRBD.
>>
>> I backported to 4.4, and also tested on 4.4.50.
>>
>> Both patches apply cleanly on 4.4.59.
>
> Any reason these shouldn't also go into 4.10 and 4.9-stable?  They look
> like they belong there, and I don't want patches in 4.4 that are not in
> other stable trees, that wouldn't make any sense and would cause
> regressions when people upgrade.
>
> thanks,
>
> greg k-h

Thanks, Greg,

It makes sense to apply to also 4.9 and 4.10.

-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-04-06  7:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-03  9:41 [PATCH 0/2] bio order fix backport for 4.4 Jack Wang
2017-04-03  9:41 ` [PATCH 1/2] blk: improve order of bio handling in generic_make_request() Jack Wang
2017-04-03  9:41 ` [PATCH 2/2] blk: Ensure users for current->bio_list can see the full list Jack Wang
2017-04-06  7:43 ` [PATCH 0/2] bio order fix backport for 4.4 Greg KH
2017-04-06  7:50   ` Jinpu Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.