All of lore.kernel.org
 help / color / mirror / Atom feed
From: 王金浦 <jinpuwang@gmail.com>
To: Mike Snitzer <snitzer@redhat.com>
Cc: NeilBrown <neilb@suse.com>, Mikulas Patocka <mpatocka@redhat.com>,
	Jack Wang <jack.wang.usish@gmail.com>,
	Lars Ellenberg <lars.ellenberg@linbit.com>,
	Jens Axboe <axboe@kernel.dk>,
	linux-raid <linux-raid@vger.kernel.org>,
	Michael Wang <yun.wang@profitbricks.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Jiri Kosina <jkosina@suse.cz>, Ming Lei <ming.lei@canonical.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Zheng Liu <gnehzuil.liu@gmail.com>,
	linux-block@vger.kernel.org, Takashi Iwai <tiwai@suse.de>,
	"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Alasdair Kergon <agk@redhat.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Keith Busch <keith.busch@intel.com>,
	device-mapper development <dm-devel@redhat.>
Subject: Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
Date: Thu, 5 Jan 2017 11:54:00 +0100	[thread overview]
Message-ID: <CAD9gYJ+U7drRjswfOf310v2aSgUKAnYT4STJcfxwqPhBzXEccg@mail.gmail.com> (raw)
In-Reply-To: <20170104185046.GA982@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4392 bytes --]

2017-01-04 19:50 GMT+01:00 Mike Snitzer <snitzer@redhat.com>:
> On Wed, Jan 04 2017 at 12:12am -0500,
> NeilBrown <neilb@suse.com> wrote:
>
>> On Tue, Jan 03 2017, Jack Wang wrote:
>>
>> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
>> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
>> >>> Dear Maintainers
>> >>>
>> >>> I'd like to ask for the status of this patch since we hit the
>> >>> issue too during our testing on md raid1.
>> >>>
>> >>> Split remainder bio_A was queued ahead, following by bio_B for
>> >>> lower device, at this moment raid start freezing, the loop take
>> >>> out bio_A firstly and deliver it, which will hung since raid is
>> >>> freezing, while the freezing never end since it waiting for
>> >>> bio_B to finish, and bio_B is still on the queue, waiting for
>> >>> bio_A to finish...
>> >>>
>> >>> We're looking for a good solution and we found this patch
>> >>> already progressed a lot, but we can't find it on linux-next,
>> >>> so we'd like to ask are we still planning to have this fix
>> >>> in upstream?
>> >>
>> >> I don't see why not, I'd even like to have it in older kernels,
>> >> but did not have the time and energy to push it.
>> >>
>> >> Thanks for the bump.
>> >>
>> >>         Lars
>> >>
>> > Hi folks,
>> >
>> > As Michael mentioned, we hit a bug this patch is trying to fix.
>> > Neil suggested another way to fix it.  I attached below.
>> > I personal prefer Neil's version as it's less code change, and straight forward.
>> >
>> > Could you share your comments, we can get one fix into mainline.
>> >
>> > Thanks,
>> > Jinpu
>> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
>> > From: NeilBrown <neilb@suse.com>
>> > Date: Wed, 14 Dec 2016 16:55:52 +0100
>> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
>> >
>> > When we call wait_barrier, we might have some bios waiting
>> > in current->bio_list, which prevents the array_freeze call to
>> > complete. Those can only be internal READs, which have already
>> > passed the wait_barrier call (thus incrementing nr_pending), but
>> > still were not submitted to the lower level, due to generic_make_request
>> > logic to avoid recursive calls. In such case, we have a deadlock:
>> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
>> > - internal READ bios will not be submitted, thus freeze_array will
>> > never completes.
>> >
>> > To fix this, modify generic_make_request to always sort bio_list_on_stack
>> > first with lowest level, then higher, until same level.
>> >
>> > Sent to linux-raid mail list:
>> > https://marc.info/?l=linux-raid&m=148232453107685&w=2
>> >
>>
>> This should probably also have
>>
>>   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
>>
>> or something that, as I was building on Lars' ideas when I wrote this.
>>
>> It would also be worth noting in the description that this addresses
>> issues with dm and drbd as well as md.
>
> I never saw this patch but certainly like the relative simplicity of the
> solution when compared with other approaches taken, e.g. (5 topmost
> commits on this branch):
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
>
>> In fact, I think that with this patch in place, much of the need for the
>> rescue_workqueue won't exist any more.  I cannot promise it can be
>> removed completely, but it should be to hard to make it optional and
>> only enabled for those few block devices that will still need it.
>> The rescuer should only be needed for a bioset which can be allocated
>> From twice in the one call the ->make_request_fn.  This would include
>> raid0 for example, though raid0_make_reqest could be re-written to not
>> use a loop and to just call generic_make_request(bio) if bio != split.
>
> Mikulas, would you be willing to try the below patch with the
> dm-snapshot deadlock scenario and report back on whether it fixes that?
>
> Patch below looks to be the same as here:
> https://marc.info/?l=linux-raid&m=148232453107685&q=p3
>
> Neil and/or others if that isn't the patch that should be tested please
> provide a pointer to the latest.
>
> Thanks,
> Mike

Thanks Mike,

I've rebased the patch on to Linux-4.10-rc2, and updated the
description as Neil suggested.
If Mikulas get possitive feedback, then we can go with it.

Cheers,
Jinpu

[-- Attachment #2: 0001-block-fix-deadlock-between-freeze_array-and-wait_bar.patch --]
[-- Type: text/x-patch, Size: 2467 bytes --]

From 4ffaefb719c129ed51f9fcb235b945caf56de8d1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Wed, 14 Dec 2016 16:55:52 +0100
Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes.

To fix this, modify generic_make_request to always sort bio_list_on_stack
first with lowest level, then higher, until same level.

This would address issuses with dm and drbd as well as md.

Sent to linux-raid mail list:
https://marc.info/?l=linux-raid&m=148232453107685&w=2

Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/blk-core.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..2f74129 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2019,9 +2019,30 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
+
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {
-- 
2.7.4


WARNING: multiple messages have this Message-ID (diff)
From: 王金浦 <jinpuwang@gmail.com>
To: Mike Snitzer <snitzer@redhat.com>
Cc: NeilBrown <neilb@suse.com>, Mikulas Patocka <mpatocka@redhat.com>,
	Jack Wang <jack.wang.usish@gmail.com>,
	Lars Ellenberg <lars.ellenberg@linbit.com>,
	Jens Axboe <axboe@kernel.dk>,
	linux-raid <linux-raid@vger.kernel.org>,
	Michael Wang <yun.wang@profitbricks.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Jiri Kosina <jkosina@suse.cz>, Ming Lei <ming.lei@canonical.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Zheng Liu <gnehzuil.liu@gmail.com>,
	linux-block@vger.kernel.org, Takashi Iwai <tiwai@suse.de>,
	"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Alasdair Kergon <agk@redhat.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Keith Busch <keith.busch@intel.com>,
	device-mapper development <dm-devel@redhat.com>,
	Shaohua Li <shli@kernel.org>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Roland Kammerer <roland.kammerer@linbit.com>,
	Jinpu Wang <jinpu.wang@profitbricks.com>
Subject: Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
Date: Thu, 5 Jan 2017 11:54:00 +0100	[thread overview]
Message-ID: <CAD9gYJ+U7drRjswfOf310v2aSgUKAnYT4STJcfxwqPhBzXEccg@mail.gmail.com> (raw)
In-Reply-To: <20170104185046.GA982@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4392 bytes --]

2017-01-04 19:50 GMT+01:00 Mike Snitzer <snitzer@redhat.com>:
> On Wed, Jan 04 2017 at 12:12am -0500,
> NeilBrown <neilb@suse.com> wrote:
>
>> On Tue, Jan 03 2017, Jack Wang wrote:
>>
>> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
>> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
>> >>> Dear Maintainers
>> >>>
>> >>> I'd like to ask for the status of this patch since we hit the
>> >>> issue too during our testing on md raid1.
>> >>>
>> >>> Split remainder bio_A was queued ahead, following by bio_B for
>> >>> lower device, at this moment raid start freezing, the loop take
>> >>> out bio_A firstly and deliver it, which will hung since raid is
>> >>> freezing, while the freezing never end since it waiting for
>> >>> bio_B to finish, and bio_B is still on the queue, waiting for
>> >>> bio_A to finish...
>> >>>
>> >>> We're looking for a good solution and we found this patch
>> >>> already progressed a lot, but we can't find it on linux-next,
>> >>> so we'd like to ask are we still planning to have this fix
>> >>> in upstream?
>> >>
>> >> I don't see why not, I'd even like to have it in older kernels,
>> >> but did not have the time and energy to push it.
>> >>
>> >> Thanks for the bump.
>> >>
>> >>         Lars
>> >>
>> > Hi folks,
>> >
>> > As Michael mentioned, we hit a bug this patch is trying to fix.
>> > Neil suggested another way to fix it.  I attached below.
>> > I personal prefer Neil's version as it's less code change, and straight forward.
>> >
>> > Could you share your comments, we can get one fix into mainline.
>> >
>> > Thanks,
>> > Jinpu
>> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
>> > From: NeilBrown <neilb@suse.com>
>> > Date: Wed, 14 Dec 2016 16:55:52 +0100
>> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
>> >
>> > When we call wait_barrier, we might have some bios waiting
>> > in current->bio_list, which prevents the array_freeze call to
>> > complete. Those can only be internal READs, which have already
>> > passed the wait_barrier call (thus incrementing nr_pending), but
>> > still were not submitted to the lower level, due to generic_make_request
>> > logic to avoid recursive calls. In such case, we have a deadlock:
>> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
>> > - internal READ bios will not be submitted, thus freeze_array will
>> > never completes.
>> >
>> > To fix this, modify generic_make_request to always sort bio_list_on_stack
>> > first with lowest level, then higher, until same level.
>> >
>> > Sent to linux-raid mail list:
>> > https://marc.info/?l=linux-raid&m=148232453107685&w=2
>> >
>>
>> This should probably also have
>>
>>   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
>>
>> or something that, as I was building on Lars' ideas when I wrote this.
>>
>> It would also be worth noting in the description that this addresses
>> issues with dm and drbd as well as md.
>
> I never saw this patch but certainly like the relative simplicity of the
> solution when compared with other approaches taken, e.g. (5 topmost
> commits on this branch):
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
>
>> In fact, I think that with this patch in place, much of the need for the
>> rescue_workqueue won't exist any more.  I cannot promise it can be
>> removed completely, but it should be to hard to make it optional and
>> only enabled for those few block devices that will still need it.
>> The rescuer should only be needed for a bioset which can be allocated
>> From twice in the one call the ->make_request_fn.  This would include
>> raid0 for example, though raid0_make_reqest could be re-written to not
>> use a loop and to just call generic_make_request(bio) if bio != split.
>
> Mikulas, would you be willing to try the below patch with the
> dm-snapshot deadlock scenario and report back on whether it fixes that?
>
> Patch below looks to be the same as here:
> https://marc.info/?l=linux-raid&m=148232453107685&q=p3
>
> Neil and/or others if that isn't the patch that should be tested please
> provide a pointer to the latest.
>
> Thanks,
> Mike

Thanks Mike,

I've rebased the patch on to Linux-4.10-rc2, and updated the
description as Neil suggested.
If Mikulas get possitive feedback, then we can go with it.

Cheers,
Jinpu

[-- Attachment #2: 0001-block-fix-deadlock-between-freeze_array-and-wait_bar.patch --]
[-- Type: text/x-patch, Size: 2467 bytes --]

From 4ffaefb719c129ed51f9fcb235b945caf56de8d1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Wed, 14 Dec 2016 16:55:52 +0100
Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes.

To fix this, modify generic_make_request to always sort bio_list_on_stack
first with lowest level, then higher, until same level.

This would address issuses with dm and drbd as well as md.

Sent to linux-raid mail list:
https://marc.info/?l=linux-raid&m=148232453107685&w=2

Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/blk-core.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..2f74129 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2019,9 +2019,30 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
+
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {
-- 
2.7.4


  reply	other threads:[~2017-01-05 10:54 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-08 15:04 [PATCH 0/1] block: fix blk_queue_split() resource exhaustion Lars Ellenberg
2016-07-08 15:04 ` [PATCH 1/1] " Lars Ellenberg
2016-07-08 18:49   ` Mike Snitzer
2016-07-11 14:13     ` Lars Ellenberg
2016-07-11 14:10   ` [PATCH v2 " Lars Ellenberg
2016-07-12  2:55     ` [dm-devel] " NeilBrown
2016-07-13  2:18       ` Eric Wheeler
2016-07-13  2:32         ` Mike Snitzer
2016-07-19  9:00           ` Lars Ellenberg
2016-07-19  9:00             ` Lars Ellenberg
2016-07-21 22:53             ` Eric Wheeler
2016-07-25 20:39               ` Jeff Moyer
2016-08-11  4:16             ` Eric Wheeler
2017-01-07 19:56             ` Lars Ellenberg
2017-01-07 19:56               ` Lars Ellenberg
2016-09-16 20:14         ` [dm-devel] " Matthias Ferdinand
2016-09-18 23:10           ` Eric Wheeler
2016-09-19 20:43             ` Matthias Ferdinand
2016-09-21 21:08               ` bcache: discard BUG (was: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion) Eric Wheeler
2016-12-23  8:49     ` [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion Michael Wang
2016-12-23 11:45       ` Lars Ellenberg
2016-12-23 11:45         ` Lars Ellenberg
2017-01-02 14:33         ` [dm-devel] " Jack Wang
2017-01-02 14:33           ` Jack Wang
2017-01-04  5:12           ` NeilBrown
2017-01-04  5:12             ` [dm-devel] " NeilBrown
2017-01-04 18:50             ` Mike Snitzer
2017-01-04 18:50               ` Mike Snitzer
2017-01-05 10:54               ` 王金浦 [this message]
2017-01-05 10:54                 ` 王金浦
2017-01-06 16:50               ` Mikulas Patocka
2017-01-06 16:50                 ` Mikulas Patocka
2017-01-06 17:34                 ` Mikulas Patocka
2017-01-06 17:34                   ` Mikulas Patocka
2017-01-06 17:34                   ` Mikulas Patocka
2017-01-06 19:52                   ` Mike Snitzer
2017-01-06 19:52                     ` Mike Snitzer
2017-01-06 23:01                     ` NeilBrown
2017-01-06 23:01                       ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAD9gYJ+U7drRjswfOf310v2aSgUKAnYT4STJcfxwqPhBzXEccg@mail.gmail.com \
    --to=jinpuwang@gmail.com \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat. \
    --cc=gnehzuil.liu@gmail.com \
    --cc=jack.wang.usish@gmail.com \
    --cc=jkosina@suse.cz \
    --cc=keith.busch@intel.com \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ming.lei@canonical.com \
    --cc=mingo@redhat.com \
    --cc=mpatocka@redhat.com \
    --cc=neilb@suse.com \
    --cc=peterz@infradead.org \
    --cc=snitzer@redhat.com \
    --cc=tiwai@suse.de \
    --cc=yun.wang@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.