From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751209AbcFXO1T (ORCPT ); Fri, 24 Jun 2016 10:27:19 -0400 Received: from zimbra13.linbit.com ([212.69.166.240]:45153 "EHLO zimbra13.linbit.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750972AbcFXO1R (ORCPT ); Fri, 24 Jun 2016 10:27:17 -0400 Date: Fri, 24 Jun 2016 16:27:11 +0200 From: Lars Ellenberg To: Ming Lei Cc: linux-block@vger.kernel.org, Roland Kammerer , Jens Axboe , NeilBrown , Kent Overstreet , Shaohua Li , Alasdair Kergon , Mike Snitzer , "open list:DEVICE-MAPPER (LVM)" , Ingo Molnar , Peter Zijlstra , Takashi Iwai , Jiri Kosina , Zheng Liu , Keith Busch , "Martin K. Petersen" , "Kirill A. Shutemov" , Linux Kernel Mailing List , "open list:BCACHE (BLOCK LAYER CACHE)" , "open list:SOFTWARE RAID (Multiple Disks) SUPPORT" Subject: Re: [RFC] block: fix blk_queue_split() resource exhaustion Message-ID: <20160624142711.GF3239@soda.linbit> References: <1466583730-28595-1-git-send-email-lars.ellenberg@linbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote: > > > > This is not a theoretical problem. > > At least int DRBD, and an unfortunately high IO concurrency wrt. the > > "max-buffers" setting, without this patch we have a reproducible deadlock. > > Is there any log about the deadlock? And is there any lockdep warning > if it is enabled? In DRBD, to avoid potentially very long internal queues as we wait for our replication peer device and local backend, we limit the number of in-flight bios we accept, and block in our ->make_request_fn() if that number exceeds a configured watermark ("max-buffers"). Works fine, as long as we could assume that once our make_request_fn() returns, any bios we "recursively" submitted against the local backend would be dispatched. Which used to be the case. commit 54efd50 block: make generic_make_request handle arbitrarily sized bios introduced blk_queue_split(), which will split any bio that is violating the queue limits into a smaller piece to be processed right away, and queue the "rest" on current->bio_list to be processed by the iteration in generic_make_request() once the current ->make_request_fn() returns. Before that, any user was supposed to go through bio_add_page(), which would call our merge bvec function, and that should already be sufficient to not violate queue limits. Previously, typically the next in line bio to be processed would be the cloned one we passed down to our backend device, which in case of further stacking devices (e.g. logical volume, raid1 or similar) may again push further bios on that list. All of which would simply be processed in turn. Now, with blk_queue_split(), the next-in-line bio would be the next piece of the "too big" original bio, so we end up calling several times into our ->make_request_fn() without even dispatching one bio to the backend. With highly concurrent IO and/or very small "max-buffers" setting, this can deadlock, where the current submit path would wait for the completion of the bios supposedly already passed to the backend, but which actually are not even dispatched yet, rotting away on current->bio_list. I could code around that in various ways inside the DRBD make_request_fn, or always dispatch the backend bios to a different (thread or work_queue) context, but I'd rather have the distinction between "recursively submitted" bios and "pushed back part of originally passed in bio" as implemented in this patch. Thanks, Lars > > I'm unsure if it could cause problems in md raid in some corner cases > > or when a rebuild/scrub/... starts in a "bad" moment. > > > > It also may increase "stress" on any involved bio_set, > > because it may increase the number of bios that are allocated > > before the first of them is actually processed, causing more > > frequent calls of punt_bios_to_rescuer(). > > > > Alternatively, > > a simple one-line change to generic_make_request() would also "fix" it, > > by processing all recursive bios in LIFO order. > > But that may change the order in which bios reach the "physical" layer, > > in case they are in fact split into many smaller pieces, for example in > > a striping setup, which may have other performance implications. > > > > | diff --git a/block/blk-core.c b/block/blk-core.c > > | index 2475b1c7..a5623f6 100644 > > | --- a/block/blk-core.c > > | +++ b/block/blk-core.c > > | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio) > > | * should be added at the tail > > | */ > > | if (current->bio_list) { > > | - bio_list_add(current->bio_list, bio); > > | + bio_list_add_head(current->bio_list, bio); > > | goto out; > > | } > > diff --git a/block/blk-core.c b/block/blk-core.c > > index 2475b1c7..f03ff4c 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -2031,7 +2031,7 @@ end_io: > > */ > > blk_qc_t generic_make_request(struct bio *bio) > > { > > - struct bio_list bio_list_on_stack; > > + struct recursion_to_iteration_bio_lists bio_lists_on_stack; > > blk_qc_t ret = BLK_QC_T_NONE; > > > > if (!generic_make_request_checks(bio)) > > @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio) > > - if (current->bio_list) { > > - bio_list_add(current->bio_list, bio); > > + if (current->bio_lists) { > > + bio_list_add(¤t->bio_lists->recursion, bio); > > goto out; > > } > > > > @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio) > > * bio_list, and call into ->make_request() again. > > */ > > BUG_ON(bio->bi_next); > > - bio_list_init(&bio_list_on_stack); > > - current->bio_list = &bio_list_on_stack; > > + bio_list_init(&bio_lists_on_stack.recursion); > > + bio_list_init(&bio_lists_on_stack.remainder); > > + current->bio_lists = &bio_lists_on_stack; > > do { > > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > > > @@ -2076,16 +2080,14 @@ blk_qc_t generic_make_request(struct bio *bio) > > ret = q->make_request_fn(q, bio); > > > > blk_queue_exit(q); > > - > > - bio = bio_list_pop(current->bio_list); > > } else { > > - struct bio *bio_next = bio_list_pop(current->bio_list); > > - > > bio_io_error(bio); > > - bio = bio_next; > > } > > + bio = bio_list_pop(¤t->bio_lists->recursion); > > + if (!bio) > > + bio = bio_list_pop(¤t->bio_lists->remainder); > > } while (bio); > > - current->bio_list = NULL; /* deactivate */ > > + current->bio_lists = NULL; /* deactivate */ > > > > out: > > return ret; > > diff --git a/block/blk-merge.c b/block/blk-merge.c > > index 2613531..8da0c22 100644 > > --- a/block/blk-merge.c > > +++ b/block/blk-merge.c > > @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, > > struct bio *split, *res; > > unsigned nsegs; > > > > + BUG_ON(!current->bio_lists); > > if ((*bio)->bi_rw & REQ_DISCARD) > > split = blk_bio_discard_split(q, *bio, bs, &nsegs); > > else if ((*bio)->bi_rw & REQ_WRITE_SAME) > > @@ -190,7 +191,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, > > > > bio_chain(split, *bio); > > trace_block_split(q, split, (*bio)->bi_iter.bi_sector); > > - generic_make_request(*bio); > > + bio_list_add_head(¤t->bio_lists->remainder, *bio); > > *bio = split;