All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jack Wang <jinpu.wang@profitbricks.com>
To: NeilBrown <neilb@suse.com>, Jens Axboe <axboe@kernel.dk>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Lars Ellenberg <lars.ellenberg@linbit.com>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	Pavel Machek <pavel@ucw.cz>, Mike Snitzer <snitzer@redhat.com>,
	Mikulas Patocka <mpatocka@redhat.com>
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()
Date: Fri, 3 Mar 2017 10:28:44 +0100	[thread overview]
Message-ID: <71562c2c-97f4-9a0a-32ec-30e0702ca575@profitbricks.com> (raw)
In-Reply-To: <87h93blz6g.fsf@notabene.neil.brown.name>



On 03.03.2017 06:14, NeilBrown wrote:
> 
> [ Hi Jens,
>   you might have seen assorted email threads recently about
>   deadlocks, particular in dm-snap or md/raid1/10.  Also about
>   the excess of rescuer threads.
>   I think a big part of the problem is my ancient improvement
>   to generic_make_request to queue bios and handle them in
>   a strict FIFO order.  As described below, that can cause
>   problems which individual block devices cannot fix themselves
>   without punting to various threads.
>   This patch does not fix everything, but provides a basis that
>   drives can build on to create dead-lock free solutions without
>   excess threads.
>   If you accept this, I will look into improving at least md
>   and bio_alloc_set() to be less dependant on rescuer threads.
>   Thanks,
>   NeilBrown
>  ]
> 
> 
> To avoid recursion on the kernel stack when stacked block devices
> are in use, generic_make_request() will, when called recursively,
> queue new requests for later handling.  They will be handled when the
> make_request_fn for the current bio completes.
> 
> If any bios are submitted by a make_request_fn, these will ultimately
> handled seqeuntially.  If the handling of one of those generates
> further requests, they will be added to the end of the queue.
> 
> This strict first-in-first-out behaviour can lead to deadlocks in
> various ways, normally because a request might need to wait for a
> previous request to the same device to complete.  This can happen when
> they share a mempool, and can happen due to interdependencies
> particular to the device.  Both md and dm have examples where this happens.
> 
> These deadlocks can be erradicated by more selective ordering of bios.
> Specifically by handling them in depth-first order.  That is: when the
> handling of one bio generates one or more further bios, they are
> handled immediately after the parent, before any siblings of the
> parent.  That way, when generic_make_request() calls make_request_fn
> for some particular device, it we can be certain that all previously
> submited request for that device have been completely handled and are
> not waiting for anything in the queue of requests maintained in
> generic_make_request().
> 
> An easy way to achieve this would be to use a last-in-first-out stack
> instead of a queue.  However this will change the order of consecutive
> bios submitted by a make_request_fn, which could have unexpected consequences.
> Instead we take a slightly more complex approach.
> A fresh queue is created for each call to a make_request_fn.  After it completes,
> any bios for a different device are placed on the front of the main queue, followed
> by any bios for the same device, followed by all bios that were already on
> the queue before the make_request_fn was called.
> This provides the depth-first approach without reordering bios on the same level.
> 
> This, by itself, it not enough to remove the deadlocks.  It just makes
> it possible for drivers to take the extra step required themselves.
> 
> To avoid deadlocks, drivers must never risk waiting for a request
> after submitting one to generic_make_request.  This includes never
> allocing from a mempool twice in the one call to a make_request_fn.
> 
> A common pattern in drivers is to call bio_split() in a loop, handling
> the first part and then looping around to possibly split the next part.
> Instead, a driver that finds it needs to split a bio should queue
> (with generic_make_request) the second part, handle the first part,
> and then return.  The new code in generic_make_request will ensure the
> requests to underlying bios are processed first, then the second bio
> that was split off.  If it splits again, the same process happens.  In
> each case one bio will be completely handled before the next one is attempted.
> 
> With this is place, it should be possible to disable the
> punt_bios_to_recover() recovery thread for many block devices, and
> eventually it may be possible to remove it completely.
> 
> Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  block/blk-core.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b9e857f4afe8..ef55f210dd7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, false) == 0)) {
> +			struct bio_list hold;
> +			struct bio_list lower, same;
> +
> +			/* Create a fresh bio_list for all subordinate requests */
> +			bio_list_init(&hold);
> +			bio_list_merge(&hold, &bio_list_on_stack);
> +			bio_list_init(&bio_list_on_stack);
>  			ret = q->make_request_fn(q, bio);
>  
>  			blk_queue_exit(q);
>  
> +			/* sort new bios into those for a lower level
> +			 * and those for the same level
> +			 */
> +			bio_list_init(&lower);
> +			bio_list_init(&same);
> +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> +				if (q == bdev_get_queue(bio->bi_bdev))
> +					bio_list_add(&same, bio);
> +				else
> +					bio_list_add(&lower, bio);
> +			/* now assemble so we handle the lowest level first */
> +			bio_list_merge(&bio_list_on_stack, &lower);
> +			bio_list_merge(&bio_list_on_stack, &same);
> +			bio_list_merge(&bio_list_on_stack, &hold);
> +
>  			bio = bio_list_pop(current->bio_list);
>  		} else {
>  			struct bio *bio_next = bio_list_pop(current->bio_list);
> 

Thanks Neil for pushing the fix.

We can optimize generic_make_request a little bit:
- assign bio_list struct hold directly instead init and merge
- remove duplicate code

I think better to squash into your fix.
---
 block/blk-core.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bc7202..b29b7e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 			struct bio_list lower, same, hold;
 
 			/* Create a fresh bio_list for all subordinate requests */
-			bio_list_init(&hold);
-			bio_list_merge(&hold, &bio_list_on_stack);
+			hold = bio_list_on_stack;
 			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
@@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
 			bio_list_merge(&bio_list_on_stack, &lower);
 			bio_list_merge(&bio_list_on_stack, &same);
 			bio_list_merge(&bio_list_on_stack, &hold);
-
-			bio = bio_list_pop(current->bio_list);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(current->bio_list);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
-- 

Regards,
Jack 

  reply	other threads:[~2017-03-03  9:29 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-03  5:14 [PATCH] blk: improve order of bio handling in generic_make_request() NeilBrown
2017-03-03  9:28 ` Jack Wang [this message]
2017-03-06  4:40   ` NeilBrown
2017-03-06  9:43     ` Jack Wang
2017-03-07 15:46       ` Pavel Machek
2017-03-07 15:53         ` Jack Wang
2017-03-07 16:21         ` Jens Axboe
2017-03-06 20:18     ` Jens Axboe
2017-03-07  8:49       ` Jack Wang
2017-03-07 16:52         ` Mike Snitzer
2017-03-07 17:05           ` Jens Axboe
2017-03-07 17:14             ` Mike Snitzer
2017-03-07 20:29               ` NeilBrown
2017-03-07 23:01                 ` Mike Snitzer
2017-03-08 16:40                 ` Mikulas Patocka
2017-03-08 17:15                   ` Lars Ellenberg
2017-03-09  6:08                   ` NeilBrown
2017-03-08 11:46           ` Lars Ellenberg
2017-03-07 20:38       ` [PATCH v2] " NeilBrown
2017-03-07 20:38         ` NeilBrown
2017-03-10  4:32         ` NeilBrown
2017-03-10  4:33           ` [PATCH 1/5 v3] " NeilBrown
2017-03-10  4:34           ` [PATCH 2/5] blk: remove bio_set arg from blk_queue_split() NeilBrown
2017-03-10  4:35           ` [PATCH 3/5] blk: make the bioset rescue_workqueue optional NeilBrown
2017-03-10  4:36           ` [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split NeilBrown
2017-03-10  4:37           ` [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset NeilBrown
2017-03-10  4:38           ` [PATCH v2] blk: improve order of bio handling in generic_make_request() Jens Axboe
2017-03-10  4:40             ` Jens Axboe
2017-03-10  5:19             ` NeilBrown
2017-03-10 12:34               ` Lars Ellenberg
2017-03-10 14:38                 ` Mike Snitzer
2017-03-10 14:55                   ` Mikulas Patocka
2017-03-10 15:07                     ` Jack Wang
2017-03-10 15:35                       ` Mike Snitzer
2017-03-10 18:51                       ` Lars Ellenberg
2017-03-11  0:47                 ` NeilBrown
2017-03-11  0:47                   ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=71562c2c-97f4-9a0a-32ec-30e0702ca575@profitbricks.com \
    --to=jinpu.wang@profitbricks.com \
    --cc=axboe@kernel.dk \
    --cc=kent.overstreet@gmail.com \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=neilb@suse.com \
    --cc=pavel@ucw.cz \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.