From: NeilBrown <neilb@suse.com> To: Jens Axboe <axboe@kernel.dk>, Jack Wang <jinpu.wang@profitbricks.com> Cc: linux-block@vger.kernel.org, Mike Snitzer <snitzer@redhat.com>, LKML <linux-kernel@vger.kernel.org>, linux-raid@vger.kernel.org, device-mapper development <dm-devel@redhat.com>, Mikulas Patocka <mpatocka@redhat.com>, Pavel Machek <pavel@ucw.cz>, Lars Ellenberg <lars.ellenberg@linbit.com>, Kent Overstreet <kent.overstreet@gmail.com> Subject: [PATCH v2] blk: improve order of bio handling in generic_make_request() Date: Wed, 08 Mar 2017 07:38:05 +1100 [thread overview] Message-ID: <87r328j00i.fsf@notabene.neil.brown.name> (raw) In-Reply-To: <a674456d-fb93-437e-c50e-195e7a035ba4@kernel.dk> [-- Attachment #1.1: Type: text/plain, Size: 5090 bytes --] To avoid recursion on the kernel stack when stacked block devices are in use, generic_make_request() will, when called recursively, queue new requests for later handling. They will be handled when the make_request_fn for the current bio completes. If any bios are submitted by a make_request_fn, these will ultimately be handled seqeuntially. If the handling of one of those generates further requests, they will be added to the end of the queue. This strict first-in-first-out behaviour can lead to deadlocks in various ways, normally because a request might need to wait for a previous request to the same device to complete. This can happen when they share a mempool, and can happen due to interdependencies particular to the device. Both md and dm have examples where this happens. These deadlocks can be erradicated by more selective ordering of bios. Specifically by handling them in depth-first order. That is: when the handling of one bio generates one or more further bios, they are handled immediately after the parent, before any siblings of the parent. That way, when generic_make_request() calls make_request_fn for some particular device, we can be certain that all previously submited requests for that device have been completely handled and are not waiting for anything in the queue of requests maintained in generic_make_request(). An easy way to achieve this would be to use a last-in-first-out stack instead of a queue. However this will change the order of consecutive bios submitted by a make_request_fn, which could have unexpected consequences. Instead we take a slightly more complex approach. A fresh queue is created for each call to a make_request_fn. After it completes, any bios for a different device are placed on the front of the main queue, followed by any bios for the same device, followed by all bios that were already on the queue before the make_request_fn was called. This provides the depth-first approach without reordering bios on the same level. This, by itself, it not enough to remove all deadlocks. It just makes it possible for drivers to take the extra step required themselves. To avoid deadlocks, drivers must never risk waiting for a request after submitting one to generic_make_request. This includes never allocing from a mempool twice in the one call to a make_request_fn. A common pattern in drivers is to call bio_split() in a loop, handling the first part and then looping around to possibly split the next part. Instead, a driver that finds it needs to split a bio should queue (with generic_make_request) the second part, handle the first part, and then return. The new code in generic_make_request will ensure the requests to underlying bios are processed first, then the second bio that was split off. If it splits again, the same process happens. In each case one bio will be completely handled before the next one is attempted. With this is place, it should be possible to disable the punt_bios_to_recover() recovery thread for many block devices, and eventually it may be possible to remove it completely. Ref: http://www.spinics.net/lists/raid/msg54680.html Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NeilBrown <neilb@suse.com> --- block/blk-core.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) Changes since v1: - merge code improvements from Jack Wang - more edits to changelog comment - add Ref: link. - Add some lists to Cc, that should have been there the first time. diff --git a/block/blk-core.c b/block/blk-core.c index b9e857f4afe8..9520e82aa78c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2018,17 +2018,34 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, false) == 0)) { + struct bio_list hold; + struct bio_list lower, same; + + /* Create a fresh bio_list for all subordinate requests */ + hold = bio_list_on_stack; + bio_list_init(&bio_list_on_stack); ret = q->make_request_fn(q, bio); blk_queue_exit(q); - bio = bio_list_pop(current->bio_list); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); } else { - struct bio *bio_next = bio_list_pop(current->bio_list); - bio_io_error(bio); - bio = bio_next; } + bio = bio_list_pop(current->bio_list); } while (bio); current->bio_list = NULL; /* deactivate */ -- 2.12.0 [-- Attachment #1.2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --]
WARNING: multiple messages have this Message-ID (diff)
From: NeilBrown <neilb@suse.com> To: Jens Axboe <axboe@kernel.dk>, Jack Wang <jinpu.wang@profitbricks.com> Cc: LKML <linux-kernel@vger.kernel.org>, Lars Ellenberg <lars.ellenberg@linbit.com>, Kent Overstreet <kent.overstreet@gmail.com>, Pavel Machek <pavel@ucw.cz>, Mike Snitzer <snitzer@redhat.com>, Mikulas Patocka <mpatocka@redhat.com> Cc: linux-raid@vger.kernel.org, device-mapper development <dm-devel@redhat.com>, linux-block@vger.kernel.org Subject: [PATCH v2] blk: improve order of bio handling in generic_make_request() Date: Wed, 08 Mar 2017 07:38:05 +1100 [thread overview] Message-ID: <87r328j00i.fsf@notabene.neil.brown.name> (raw) In-Reply-To: <a674456d-fb93-437e-c50e-195e7a035ba4@kernel.dk> [-- Attachment #1: Type: text/plain, Size: 5090 bytes --] To avoid recursion on the kernel stack when stacked block devices are in use, generic_make_request() will, when called recursively, queue new requests for later handling. They will be handled when the make_request_fn for the current bio completes. If any bios are submitted by a make_request_fn, these will ultimately be handled seqeuntially. If the handling of one of those generates further requests, they will be added to the end of the queue. This strict first-in-first-out behaviour can lead to deadlocks in various ways, normally because a request might need to wait for a previous request to the same device to complete. This can happen when they share a mempool, and can happen due to interdependencies particular to the device. Both md and dm have examples where this happens. These deadlocks can be erradicated by more selective ordering of bios. Specifically by handling them in depth-first order. That is: when the handling of one bio generates one or more further bios, they are handled immediately after the parent, before any siblings of the parent. That way, when generic_make_request() calls make_request_fn for some particular device, we can be certain that all previously submited requests for that device have been completely handled and are not waiting for anything in the queue of requests maintained in generic_make_request(). An easy way to achieve this would be to use a last-in-first-out stack instead of a queue. However this will change the order of consecutive bios submitted by a make_request_fn, which could have unexpected consequences. Instead we take a slightly more complex approach. A fresh queue is created for each call to a make_request_fn. After it completes, any bios for a different device are placed on the front of the main queue, followed by any bios for the same device, followed by all bios that were already on the queue before the make_request_fn was called. This provides the depth-first approach without reordering bios on the same level. This, by itself, it not enough to remove all deadlocks. It just makes it possible for drivers to take the extra step required themselves. To avoid deadlocks, drivers must never risk waiting for a request after submitting one to generic_make_request. This includes never allocing from a mempool twice in the one call to a make_request_fn. A common pattern in drivers is to call bio_split() in a loop, handling the first part and then looping around to possibly split the next part. Instead, a driver that finds it needs to split a bio should queue (with generic_make_request) the second part, handle the first part, and then return. The new code in generic_make_request will ensure the requests to underlying bios are processed first, then the second bio that was split off. If it splits again, the same process happens. In each case one bio will be completely handled before the next one is attempted. With this is place, it should be possible to disable the punt_bios_to_recover() recovery thread for many block devices, and eventually it may be possible to remove it completely. Ref: http://www.spinics.net/lists/raid/msg54680.html Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NeilBrown <neilb@suse.com> --- block/blk-core.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) Changes since v1: - merge code improvements from Jack Wang - more edits to changelog comment - add Ref: link. - Add some lists to Cc, that should have been there the first time. diff --git a/block/blk-core.c b/block/blk-core.c index b9e857f4afe8..9520e82aa78c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2018,17 +2018,34 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, false) == 0)) { + struct bio_list hold; + struct bio_list lower, same; + + /* Create a fresh bio_list for all subordinate requests */ + hold = bio_list_on_stack; + bio_list_init(&bio_list_on_stack); ret = q->make_request_fn(q, bio); blk_queue_exit(q); - bio = bio_list_pop(current->bio_list); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); } else { - struct bio *bio_next = bio_list_pop(current->bio_list); - bio_io_error(bio); - bio = bio_next; } + bio = bio_list_pop(current->bio_list); } while (bio); current->bio_list = NULL; /* deactivate */ -- 2.12.0 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --]
next prev parent reply other threads:[~2017-03-07 20:38 UTC|newest] Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-03-03 5:14 [PATCH] blk: improve order of bio handling in generic_make_request() NeilBrown 2017-03-03 9:28 ` Jack Wang 2017-03-06 4:40 ` NeilBrown 2017-03-06 9:43 ` Jack Wang 2017-03-07 15:46 ` Pavel Machek 2017-03-07 15:53 ` Jack Wang 2017-03-07 16:21 ` Jens Axboe 2017-03-06 20:18 ` Jens Axboe 2017-03-07 8:49 ` Jack Wang 2017-03-07 16:52 ` Mike Snitzer 2017-03-07 17:05 ` Jens Axboe 2017-03-07 17:14 ` Mike Snitzer 2017-03-07 20:29 ` NeilBrown 2017-03-07 23:01 ` Mike Snitzer 2017-03-08 16:40 ` Mikulas Patocka 2017-03-08 17:15 ` Lars Ellenberg 2017-03-09 6:08 ` NeilBrown 2017-03-08 11:46 ` Lars Ellenberg 2017-03-07 20:38 ` NeilBrown [this message] 2017-03-07 20:38 ` [PATCH v2] " NeilBrown 2017-03-10 4:32 ` NeilBrown 2017-03-10 4:33 ` [PATCH 1/5 v3] " NeilBrown 2017-03-10 4:34 ` [PATCH 2/5] blk: remove bio_set arg from blk_queue_split() NeilBrown 2017-03-10 4:35 ` [PATCH 3/5] blk: make the bioset rescue_workqueue optional NeilBrown 2017-03-10 4:36 ` [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split NeilBrown 2017-03-10 4:37 ` [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset NeilBrown 2017-03-10 4:38 ` [PATCH v2] blk: improve order of bio handling in generic_make_request() Jens Axboe 2017-03-10 4:40 ` Jens Axboe 2017-03-10 5:19 ` NeilBrown 2017-03-10 12:34 ` Lars Ellenberg 2017-03-10 14:38 ` Mike Snitzer 2017-03-10 14:55 ` Mikulas Patocka 2017-03-10 15:07 ` Jack Wang 2017-03-10 15:35 ` Mike Snitzer 2017-03-10 18:51 ` Lars Ellenberg 2017-03-11 0:47 ` NeilBrown 2017-03-11 0:47 ` NeilBrown
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=87r328j00i.fsf@notabene.neil.brown.name \ --to=neilb@suse.com \ --cc=axboe@kernel.dk \ --cc=dm-devel@redhat.com \ --cc=jinpu.wang@profitbricks.com \ --cc=kent.overstreet@gmail.com \ --cc=lars.ellenberg@linbit.com \ --cc=linux-block@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-raid@vger.kernel.org \ --cc=mpatocka@redhat.com \ --cc=pavel@ucw.cz \ --cc=snitzer@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.