All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Jinpu Wang <jinpu.wang@profitbricks.com>,
	NeilBrown <neilb@suse.com>, Jens Axboe <axboe@fb.com>
Subject: [PATCH 4.4 24/26] blk: improve order of bio handling in generic_make_request()
Date: Thu,  6 Apr 2017 10:38:46 +0200	[thread overview]
Message-ID: <20170406083605.895052061@linuxfoundation.org> (raw)
In-Reply-To: <20170406083604.362636628@linuxfoundation.org>

4.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: NeilBrown <neilb@suse.com>

commit 79bd99596b7305ab08109a8bf44a6a4511dbf1cd upstream.

To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
[jwang: backport to 4.4]
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 block/blk-core.c |   25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2063,18 +2063,33 @@ blk_qc_t generic_make_request(struct bio
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			hold = bio_list_on_stack;
+			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
-
-			bio = bio_list_pop(current->bio_list);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(current->bio_list);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 

  parent reply	other threads:[~2017-04-06  9:35 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06  8:38 [PATCH 4.4 00/26] 4.4.60-stable review Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 01/26] libceph: force GFP_NOIO for socket allocations Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 02/26] xen/setup: Dont relocate p2m over existing one Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 03/26] scsi: mpt3sas: fix hang on ata passthrough commands Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 04/26] scsi: sg: check length passed to SG_NEXT_CMD_LEN Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 05/26] scsi: libsas: fix ata xfer length Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 06/26] ALSA: seq: Fix race during FIFO resize Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 07/26] ALSA: hda - fix a problem for lineout on a Dell AIO machine Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 08/26] ASoC: atmel-classd: fix audio clock rate Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 09/26] ACPI: Fix incompatibility with mcount-based function graph tracing Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 10/26] ACPI: Do not create a platform_device for IOAPIC/IOxAPIC Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 11/26] tty/serial: atmel: fix race condition (TX+DMA) Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 12/26] tty/serial: atmel: fix TX path in atmel_console_write() Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 13/26] USB: fix linked-list corruption in rh_call_control() Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 16/26] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd() Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 17/26] MIPS: Lantiq: Fix cascaded IRQ setup Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 22/26] KVM: kvm_io_bus_unregister_dev() should never fail Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 23/26] power: reset: at91-poweroff: timely shutdown LPDDR memories Greg Kroah-Hartman
2017-04-06 17:45   ` Ben Hutchings
2017-04-06 18:13     ` Alexandre Belloni
2017-04-06 18:22       ` Ben Hutchings
2017-04-06  8:38 ` Greg Kroah-Hartman [this message]
2017-04-06  8:38 ` [PATCH 4.4 25/26] blk: Ensure users for current->bio_list can see the full list Greg Kroah-Hartman
2017-04-06 18:17   ` Ben Hutchings
2017-04-07  8:33     ` Jinpu Wang
2017-04-07 12:45       ` Ben Hutchings
2017-04-07 13:03         ` Jinpu Wang
2017-04-07 13:37           ` Ben Hutchings
2017-04-08  7:52             ` Greg Kroah-Hartman
2017-04-06  8:38 ` [PATCH 4.4 26/26] padata: avoid race in reordering Greg Kroah-Hartman
2017-04-06 17:47 ` [PATCH 4.4 00/26] 4.4.60-stable review Shuah Khan
2017-04-06 21:53 ` Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170406083605.895052061@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=axboe@fb.com \
    --cc=jinpu.wang@profitbricks.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.