linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@redhat.com>
To: NeilBrown <neilb@suse.com>, Mikulas Patocka <mpatocka@redhat.com>
Cc: Jack Wang <jack.wang.usish@gmail.com>,
	Lars Ellenberg <lars.ellenberg@linbit.com>,
	Jens Axboe <axboe@kernel.dk>,
	linux-raid <linux-raid@vger.kernel.org>,
	Michael Wang <yun.wang@profitbricks.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Jiri Kosina <jkosina@suse.cz>, Ming Lei <ming.lei@canonical.com>,
	linux-kernel@vger.kernel.org, Zheng Liu <gnehzuil.liu@gmail.com>,
	linux-block@vger.kernel.org, Takashi Iwai <tiwai@suse.de>,
	"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Alasdair Kergon <agk@redhat.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Keith Busch <keith.busch@intel.com>,
	device-mapper development <dm-devel@redhat.com>,
	Shaohua Li <shli@kernel.org>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Roland Kammerer <roland.kammerer@linbit.com>
Subject: Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
Date: Wed, 4 Jan 2017 13:50:46 -0500	[thread overview]
Message-ID: <20170104185046.GA982@redhat.com> (raw)
In-Reply-To: <878tqrmmqx.fsf@notabene.neil.brown.name>

On Wed, Jan 04 2017 at 12:12am -0500,
NeilBrown <neilb@suse.com> wrote:

> On Tue, Jan 03 2017, Jack Wang wrote:
> 
> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
> >>> Dear Maintainers
> >>>
> >>> I'd like to ask for the status of this patch since we hit the
> >>> issue too during our testing on md raid1.
> >>>
> >>> Split remainder bio_A was queued ahead, following by bio_B for
> >>> lower device, at this moment raid start freezing, the loop take
> >>> out bio_A firstly and deliver it, which will hung since raid is
> >>> freezing, while the freezing never end since it waiting for
> >>> bio_B to finish, and bio_B is still on the queue, waiting for
> >>> bio_A to finish...
> >>>
> >>> We're looking for a good solution and we found this patch
> >>> already progressed a lot, but we can't find it on linux-next,
> >>> so we'd like to ask are we still planning to have this fix
> >>> in upstream?
> >>
> >> I don't see why not, I'd even like to have it in older kernels,
> >> but did not have the time and energy to push it.
> >>
> >> Thanks for the bump.
> >>
> >>         Lars
> >>
> > Hi folks,
> >
> > As Michael mentioned, we hit a bug this patch is trying to fix.
> > Neil suggested another way to fix it.  I attached below.
> > I personal prefer Neil's version as it's less code change, and straight forward.
> >
> > Could you share your comments, we can get one fix into mainline.
> >
> > Thanks,
> > Jinpu
> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.com>
> > Date: Wed, 14 Dec 2016 16:55:52 +0100
> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
> >
> > When we call wait_barrier, we might have some bios waiting
> > in current->bio_list, which prevents the array_freeze call to
> > complete. Those can only be internal READs, which have already
> > passed the wait_barrier call (thus incrementing nr_pending), but
> > still were not submitted to the lower level, due to generic_make_request
> > logic to avoid recursive calls. In such case, we have a deadlock:
> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
> > - internal READ bios will not be submitted, thus freeze_array will
> > never completes.
> >
> > To fix this, modify generic_make_request to always sort bio_list_on_stack
> > first with lowest level, then higher, until same level.
> >
> > Sent to linux-raid mail list:
> > https://marc.info/?l=linux-raid&m=148232453107685&w=2
> >
> 
> This should probably also have
> 
>   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> 
> or something that, as I was building on Lars' ideas when I wrote this.
> 
> It would also be worth noting in the description that this addresses
> issues with dm and drbd as well as md.

I never saw this patch but certainly like the relative simplicity of the
solution when compared with other approaches taken, e.g. (5 topmost
commits on this branch):
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

> In fact, I think that with this patch in place, much of the need for the
> rescue_workqueue won't exist any more.  I cannot promise it can be
> removed completely, but it should be to hard to make it optional and
> only enabled for those few block devices that will still need it.
> The rescuer should only be needed for a bioset which can be allocated
> From twice in the one call the ->make_request_fn.  This would include
> raid0 for example, though raid0_make_reqest could be re-written to not
> use a loop and to just call generic_make_request(bio) if bio != split.

Mikulas, would you be willing to try the below patch with the
dm-snapshot deadlock scenario and report back on whether it fixes that?

Patch below looks to be the same as here:
https://marc.info/?l=linux-raid&m=148232453107685&q=p3

Neil and/or others if that isn't the patch that should be tested please
provide a pointer to the latest.

Thanks,
Mike

> > Suggested-by: NeilBrown <neilb@suse.com>
> > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> > ---
> >  block/blk-core.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 9e3ac56..47ef373 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> >  
> >  		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
> > +			struct bio_list lower, same, hold;
> > +
> > +			/* Create a fresh bio_list for all subordinate requests */
> > +			bio_list_init(&hold);
> > +			bio_list_merge(&hold, &bio_list_on_stack);
> > +			bio_list_init(&bio_list_on_stack);
> >  
> >  			ret = q->make_request_fn(q, bio);
> >  
> >  			blk_queue_exit(q);
> > +			/* sort new bios into those for a lower level
> > +			 * and those for the same level
> > +			 */
> > +			bio_list_init(&lower);
> > +			bio_list_init(&same);
> > +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> > +				if (q == bdev_get_queue(bio->bi_bdev))
> > +					bio_list_add(&same, bio);
> > +				else
> > +					bio_list_add(&lower, bio);
> > +			/* now assemble so we handle the lowest level first */
> > +			bio_list_merge(&bio_list_on_stack, &lower);
> > +			bio_list_merge(&bio_list_on_stack, &same);
> > +			bio_list_merge(&bio_list_on_stack, &hold);
> >  
> >  			bio = bio_list_pop(current->bio_list);
> >  		} else {
> > -- 
> > 2.7.4

  reply	other threads:[~2017-01-04 18:53 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-08 15:04 [PATCH 0/1] block: fix blk_queue_split() resource exhaustion Lars Ellenberg
2016-07-08 15:04 ` [PATCH 1/1] " Lars Ellenberg
2016-07-08 18:49   ` Mike Snitzer
2016-07-11 14:13     ` Lars Ellenberg
2016-07-11 14:10   ` [PATCH v2 " Lars Ellenberg
2016-07-12  2:55     ` [dm-devel] " NeilBrown
2016-07-13  2:18       ` Eric Wheeler
2016-07-13  2:32         ` Mike Snitzer
2016-07-19  9:00           ` Lars Ellenberg
2016-07-21 22:53             ` Eric Wheeler
2016-07-25 20:39               ` Jeff Moyer
2016-08-11  4:16             ` Eric Wheeler
2017-01-07 19:56             ` Lars Ellenberg
2016-12-23  8:49     ` Michael Wang
2016-12-23 11:45       ` Lars Ellenberg
2017-01-02 14:33         ` [dm-devel] " Jack Wang
2017-01-04  5:12           ` NeilBrown
2017-01-04 18:50             ` Mike Snitzer [this message]
2017-01-05 10:54               ` 王金浦
2017-01-06 16:50               ` Mikulas Patocka
2017-01-06 17:34                 ` Mikulas Patocka
2017-01-06 19:52                   ` Mike Snitzer
2017-01-06 23:01                     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170104185046.GA982@redhat.com \
    --to=snitzer@redhat.com \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=gnehzuil.liu@gmail.com \
    --cc=jack.wang.usish@gmail.com \
    --cc=jkosina@suse.cz \
    --cc=keith.busch@intel.com \
    --cc=kent.overstreet@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ming.lei@canonical.com \
    --cc=mingo@redhat.com \
    --cc=mpatocka@redhat.com \
    --cc=neilb@suse.com \
    --cc=peterz@infradead.org \
    --cc=roland.kammerer@linbit.com \
    --cc=shli@kernel.org \
    --cc=tiwai@suse.de \
    --cc=yun.wang@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).