From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751907AbdCCJ36 (ORCPT <rfc822;w@1wt.eu>);
        Fri, 3 Mar 2017 04:29:58 -0500
Received: from mail-wm0-f44.google.com ([74.125.82.44]:36147 "EHLO
        mail-wm0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751340AbdCCJ32 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 3 Mar 2017 04:29:28 -0500
Subject: Re: [PATCH] blk: improve order of bio handling in
 generic_make_request()
To: NeilBrown <neilb@suse.com>, Jens Axboe <axboe@kernel.dk>
References: <87h93blz6g.fsf@notabene.neil.brown.name>
Cc: LKML <linux-kernel@vger.kernel.org>,
        Lars Ellenberg <lars.ellenberg@linbit.com>,
        Kent Overstreet <kent.overstreet@gmail.com>,
        Pavel Machek <pavel@ucw.cz>, Mike Snitzer <snitzer@redhat.com>,
        Mikulas Patocka <mpatocka@redhat.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Message-ID: <71562c2c-97f4-9a0a-32ec-30e0702ca575@profitbricks.com>
Date: Fri, 3 Mar 2017 10:28:44 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.7.0
MIME-Version: 1.0
In-Reply-To: <87h93blz6g.fsf@notabene.neil.brown.name>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


On 03.03.2017 06:14, NeilBrown wrote:
> 
> [ Hi Jens,
>   you might have seen assorted email threads recently about
>   deadlocks, particular in dm-snap or md/raid1/10.  Also about
>   the excess of rescuer threads.
>   I think a big part of the problem is my ancient improvement
>   to generic_make_request to queue bios and handle them in
>   a strict FIFO order.  As described below, that can cause
>   problems which individual block devices cannot fix themselves
>   without punting to various threads.
>   This patch does not fix everything, but provides a basis that
>   drives can build on to create dead-lock free solutions without
>   excess threads.
>   If you accept this, I will look into improving at least md
>   and bio_alloc_set() to be less dependant on rescuer threads.
>   Thanks,
>   NeilBrown
>  ]
> 
> 
> To avoid recursion on the kernel stack when stacked block devices
> are in use, generic_make_request() will, when called recursively,
> queue new requests for later handling.  They will be handled when the
> make_request_fn for the current bio completes.
> 
> If any bios are submitted by a make_request_fn, these will ultimately
> handled seqeuntially.  If the handling of one of those generates
> further requests, they will be added to the end of the queue.
> 
> This strict first-in-first-out behaviour can lead to deadlocks in
> various ways, normally because a request might need to wait for a
> previous request to the same device to complete.  This can happen when
> they share a mempool, and can happen due to interdependencies
> particular to the device.  Both md and dm have examples where this happens.
> 
> These deadlocks can be erradicated by more selective ordering of bios.
> Specifically by handling them in depth-first order.  That is: when the
> handling of one bio generates one or more further bios, they are
> handled immediately after the parent, before any siblings of the
> parent.  That way, when generic_make_request() calls make_request_fn
> for some particular device, it we can be certain that all previously
> submited request for that device have been completely handled and are
> not waiting for anything in the queue of requests maintained in
> generic_make_request().
> 
> An easy way to achieve this would be to use a last-in-first-out stack
> instead of a queue.  However this will change the order of consecutive
> bios submitted by a make_request_fn, which could have unexpected consequences.
> Instead we take a slightly more complex approach.
> A fresh queue is created for each call to a make_request_fn.  After it completes,
> any bios for a different device are placed on the front of the main queue, followed
> by any bios for the same device, followed by all bios that were already on
> the queue before the make_request_fn was called.
> This provides the depth-first approach without reordering bios on the same level.
> 
> This, by itself, it not enough to remove the deadlocks.  It just makes
> it possible for drivers to take the extra step required themselves.
> 
> To avoid deadlocks, drivers must never risk waiting for a request
> after submitting one to generic_make_request.  This includes never
> allocing from a mempool twice in the one call to a make_request_fn.
> 
> A common pattern in drivers is to call bio_split() in a loop, handling
> the first part and then looping around to possibly split the next part.
> Instead, a driver that finds it needs to split a bio should queue
> (with generic_make_request) the second part, handle the first part,
> and then return.  The new code in generic_make_request will ensure the
> requests to underlying bios are processed first, then the second bio
> that was split off.  If it splits again, the same process happens.  In
> each case one bio will be completely handled before the next one is attempted.
> 
> With this is place, it should be possible to disable the
> punt_bios_to_recover() recovery thread for many block devices, and
> eventually it may be possible to remove it completely.
> 
> Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  block/blk-core.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b9e857f4afe8..ef55f210dd7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, false) == 0)) {
> +			struct bio_list hold;
> +			struct bio_list lower, same;
> +
> +			/* Create a fresh bio_list for all subordinate requests */
> +			bio_list_init(&hold);
> +			bio_list_merge(&hold, &bio_list_on_stack);
> +			bio_list_init(&bio_list_on_stack);
>  			ret = q->make_request_fn(q, bio);
>  
>  			blk_queue_exit(q);
>  
> +			/* sort new bios into those for a lower level
> +			 * and those for the same level
> +			 */
> +			bio_list_init(&lower);
> +			bio_list_init(&same);
> +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> +				if (q == bdev_get_queue(bio->bi_bdev))
> +					bio_list_add(&same, bio);
> +				else
> +					bio_list_add(&lower, bio);
> +			/* now assemble so we handle the lowest level first */
> +			bio_list_merge(&bio_list_on_stack, &lower);
> +			bio_list_merge(&bio_list_on_stack, &same);
> +			bio_list_merge(&bio_list_on_stack, &hold);
> +
>  			bio = bio_list_pop(current->bio_list);
>  		} else {
>  			struct bio *bio_next = bio_list_pop(current->bio_list);
> 

Thanks Neil for pushing the fix.

We can optimize generic_make_request a little bit:
- assign bio_list struct hold directly instead init and merge
- remove duplicate code

I think better to squash into your fix.
---
 block/blk-core.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bc7202..b29b7e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 			struct bio_list lower, same, hold;
 
 			/* Create a fresh bio_list for all subordinate requests */
-			bio_list_init(&hold);
-			bio_list_merge(&hold, &bio_list_on_stack);
+			hold = bio_list_on_stack;
 			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
@@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
 			bio_list_merge(&bio_list_on_stack, &lower);
 			bio_list_merge(&bio_list_on_stack, &same);
 			bio_list_merge(&bio_list_on_stack, &hold);
-
-			bio = bio_list_pop(current->bio_list);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(current->bio_list);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
-- 

Regards,
Jack