* [RFC PATCH] raid1: reset 'bi_next' before reuse the bio @ 2017-04-04 13:50 Michael Wang 2017-04-04 22:17 ` NeilBrown 0 siblings, 1 reply; 6+ messages in thread From: Michael Wang @ 2017-04-04 13:50 UTC (permalink / raw) To: linux-raid, linux-kernel; +Cc: Shaohua Li, NeilBrown, Jinpu Wang During the testing we found the sync read bio can go through path: md_do_sync() sync_request() generic_make_request() blk_queue_bio() blk_attempt_plug_merge() bio->bi_next CHAINED HERE ... raid1d() sync_request_write() fix_sync_read_error() if FailFast && Faulty bio->bi_end_io = end_sync_write generic_make_request() BUG_ON(bio->bi_next) This need to meet the conditions: * bio once merged * read disk have FailFast enabled * read disk is Faulty And since the block layer won't reset the 'bi_next' after bio is done inside request, we hit the BUG like that. This patch simply reset the bi_next before we reuse it. Signed-off-by: Michael Wang <yun.wang@profitbricks.com> --- drivers/md/raid1.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7d67235..0554110 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ md_error(mddev, rdev); - if (test_bit(Faulty, &rdev->flags)) + if (test_bit(Faulty, &rdev->flags)) { /* Don't try to read from here, but make sure * put_buf does it's thing */ bio->bi_end_io = end_sync_write; + bio->bi_next = NULL; + } } while(sectors) { -- 2.5.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio 2017-04-04 13:50 [RFC PATCH] raid1: reset 'bi_next' before reuse the bio Michael Wang @ 2017-04-04 22:17 ` NeilBrown 0 siblings, 0 replies; 6+ messages in thread From: NeilBrown @ 2017-04-04 22:17 UTC (permalink / raw) To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang [-- Attachment #1: Type: text/plain, Size: 2653 bytes --] On Tue, Apr 04 2017, Michael Wang wrote: > During the testing we found the sync read bio can go through > path: > > md_do_sync() > sync_request() > generic_make_request() > blk_queue_bio() > blk_attempt_plug_merge() > bio->bi_next CHAINED HERE > > ... > > raid1d() > sync_request_write() > fix_sync_read_error() > if FailFast && Faulty > bio->bi_end_io = end_sync_write > generic_make_request() > BUG_ON(bio->bi_next) > > This need to meet the conditions: > * bio once merged > * read disk have FailFast enabled > * read disk is Faulty > > And since the block layer won't reset the 'bi_next' after bio > is done inside request, we hit the BUG like that. > > This patch simply reset the bi_next before we reuse it. > > Signed-off-by: Michael Wang <yun.wang@profitbricks.com> > --- > drivers/md/raid1.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 7d67235..0554110 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) > /* Don't try recovering from here - just fail it > * ... unless it is the last working device of course */ > md_error(mddev, rdev); > - if (test_bit(Faulty, &rdev->flags)) > + if (test_bit(Faulty, &rdev->flags)) { > /* Don't try to read from here, but make sure > * put_buf does it's thing > */ > bio->bi_end_io = end_sync_write; > + bio->bi_next = NULL; > + } > } > > while(sectors) { Ah - I see what is happening now. I was looking at the vanilla 4.4 code, which doesn't have the failfast changes. I don't think your patch is correct though. We really shouldn't be re-using that bio, and setting bi_next to NULL just hides the bug. It doesn't fix it. As the rdev is now Faulty, it doesn't make sense for sync_request_write() to submit a write request to it. Can you confirm that this works please. diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index d2d8b8a5bd56..219f1e1f1d1d 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio) (i == r1_bio->read_disk || !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)))) continue; + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags)) + continue; bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); if (test_bit(FailFast, &conf->mirrors[i].rdev->flags)) Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio @ 2017-04-04 22:17 ` NeilBrown 0 siblings, 0 replies; 6+ messages in thread From: NeilBrown @ 2017-04-04 22:17 UTC (permalink / raw) To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang [-- Attachment #1: Type: text/plain, Size: 2653 bytes --] On Tue, Apr 04 2017, Michael Wang wrote: > During the testing we found the sync read bio can go through > path: > > md_do_sync() > sync_request() > generic_make_request() > blk_queue_bio() > blk_attempt_plug_merge() > bio->bi_next CHAINED HERE > > ... > > raid1d() > sync_request_write() > fix_sync_read_error() > if FailFast && Faulty > bio->bi_end_io = end_sync_write > generic_make_request() > BUG_ON(bio->bi_next) > > This need to meet the conditions: > * bio once merged > * read disk have FailFast enabled > * read disk is Faulty > > And since the block layer won't reset the 'bi_next' after bio > is done inside request, we hit the BUG like that. > > This patch simply reset the bi_next before we reuse it. > > Signed-off-by: Michael Wang <yun.wang@profitbricks.com> > --- > drivers/md/raid1.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 7d67235..0554110 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) > /* Don't try recovering from here - just fail it > * ... unless it is the last working device of course */ > md_error(mddev, rdev); > - if (test_bit(Faulty, &rdev->flags)) > + if (test_bit(Faulty, &rdev->flags)) { > /* Don't try to read from here, but make sure > * put_buf does it's thing > */ > bio->bi_end_io = end_sync_write; > + bio->bi_next = NULL; > + } > } > > while(sectors) { Ah - I see what is happening now. I was looking at the vanilla 4.4 code, which doesn't have the failfast changes. I don't think your patch is correct though. We really shouldn't be re-using that bio, and setting bi_next to NULL just hides the bug. It doesn't fix it. As the rdev is now Faulty, it doesn't make sense for sync_request_write() to submit a write request to it. Can you confirm that this works please. diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index d2d8b8a5bd56..219f1e1f1d1d 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio) (i == r1_bio->read_disk || !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)))) continue; + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags)) + continue; bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); if (test_bit(FailFast, &conf->mirrors[i].rdev->flags)) Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio 2017-04-04 22:17 ` NeilBrown (?) @ 2017-04-05 7:40 ` Michael Wang 2017-04-06 2:03 ` NeilBrown -1 siblings, 1 reply; 6+ messages in thread From: Michael Wang @ 2017-04-05 7:40 UTC (permalink / raw) To: NeilBrown, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang On 04/05/2017 12:17 AM, NeilBrown wrote: [snip] >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 7d67235..0554110 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >> /* Don't try recovering from here - just fail it >> * ... unless it is the last working device of course */ >> md_error(mddev, rdev); >> - if (test_bit(Faulty, &rdev->flags)) >> + if (test_bit(Faulty, &rdev->flags)) { >> /* Don't try to read from here, but make sure >> * put_buf does it's thing >> */ >> bio->bi_end_io = end_sync_write; >> + bio->bi_next = NULL; >> + } >> } >> >> while(sectors) { > > > Ah - I see what is happening now. I was looking at the vanilla 4.4 > code, which doesn't have the failfast changes. My bad to forgot mention... yes our md stuff is very much close to the upstream. > > I don't think your patch is correct though. We really shouldn't be > re-using that bio, and setting bi_next to NULL just hides the bug. It > doesn't fix it. > As the rdev is now Faulty, it doesn't make sense for > sync_request_write() to submit a write request to it. Make sense, while still have concerns regarding the design: * in this case since the read_disk already abandoned, is it fine to keep r1_bio->read_disk recording the faulty device index? * we assign the 'end_sync_write' to the original read bio in this case, but when is this supposed to be called? > > Can you confirm that this works please. Yes, it works. Tested-by: Michael Wang <yun.wang@profitbricks.com> Regards, Michael Wang > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index d2d8b8a5bd56..219f1e1f1d1d 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio) > (i == r1_bio->read_disk || > !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)))) > continue; > + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags)) > + continue; > > bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); > if (test_bit(FailFast, &conf->mirrors[i].rdev->flags)) > > > Thanks, > NeilBrown > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio 2017-04-05 7:40 ` Michael Wang @ 2017-04-06 2:03 ` NeilBrown 0 siblings, 0 replies; 6+ messages in thread From: NeilBrown @ 2017-04-06 2:03 UTC (permalink / raw) To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang [-- Attachment #1: Type: text/plain, Size: 2743 bytes --] On Wed, Apr 05 2017, Michael Wang wrote: > On 04/05/2017 12:17 AM, NeilBrown wrote: > [snip] >>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>> index 7d67235..0554110 100644 >>> --- a/drivers/md/raid1.c >>> +++ b/drivers/md/raid1.c >>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>> /* Don't try recovering from here - just fail it >>> * ... unless it is the last working device of course */ >>> md_error(mddev, rdev); >>> - if (test_bit(Faulty, &rdev->flags)) >>> + if (test_bit(Faulty, &rdev->flags)) { >>> /* Don't try to read from here, but make sure >>> * put_buf does it's thing >>> */ >>> bio->bi_end_io = end_sync_write; >>> + bio->bi_next = NULL; >>> + } >>> } >>> >>> while(sectors) { >> >> >> Ah - I see what is happening now. I was looking at the vanilla 4.4 >> code, which doesn't have the failfast changes. > > My bad to forgot mention... yes our md stuff is very much close to the > upstream. > >> >> I don't think your patch is correct though. We really shouldn't be >> re-using that bio, and setting bi_next to NULL just hides the bug. It >> doesn't fix it. >> As the rdev is now Faulty, it doesn't make sense for >> sync_request_write() to submit a write request to it. > > Make sense, while still have concerns regarding the design: > * in this case since the read_disk already abandoned, is it fine to > keep r1_bio->read_disk recording the faulty device index? I guess we could set it to -1. I'm not sure that would help at all. > * we assign the 'end_sync_write' to the original read bio in this > case, but when is this supposed to be called? It isn't called. But the value of ->bi_end_io is tests a couple of times. Particularly in put_buf(), but also a little further down in fix_sync_read_errors(). > >> >> Can you confirm that this works please. > > Yes, it works. > > Tested-by: Michael Wang <yun.wang@profitbricks.com> Thanks. I'll add that and submit the patch. Thanks, NeilBrown > > Regards, > Michael Wang > >> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index d2d8b8a5bd56..219f1e1f1d1d 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio) >> (i == r1_bio->read_disk || >> !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)))) >> continue; >> + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags)) >> + continue; >> >> bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); >> if (test_bit(FailFast, &conf->mirrors[i].rdev->flags)) >> >> >> Thanks, >> NeilBrown >> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio @ 2017-04-06 2:03 ` NeilBrown 0 siblings, 0 replies; 6+ messages in thread From: NeilBrown @ 2017-04-06 2:03 UTC (permalink / raw) To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang [-- Attachment #1: Type: text/plain, Size: 2743 bytes --] On Wed, Apr 05 2017, Michael Wang wrote: > On 04/05/2017 12:17 AM, NeilBrown wrote: > [snip] >>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>> index 7d67235..0554110 100644 >>> --- a/drivers/md/raid1.c >>> +++ b/drivers/md/raid1.c >>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>> /* Don't try recovering from here - just fail it >>> * ... unless it is the last working device of course */ >>> md_error(mddev, rdev); >>> - if (test_bit(Faulty, &rdev->flags)) >>> + if (test_bit(Faulty, &rdev->flags)) { >>> /* Don't try to read from here, but make sure >>> * put_buf does it's thing >>> */ >>> bio->bi_end_io = end_sync_write; >>> + bio->bi_next = NULL; >>> + } >>> } >>> >>> while(sectors) { >> >> >> Ah - I see what is happening now. I was looking at the vanilla 4.4 >> code, which doesn't have the failfast changes. > > My bad to forgot mention... yes our md stuff is very much close to the > upstream. > >> >> I don't think your patch is correct though. We really shouldn't be >> re-using that bio, and setting bi_next to NULL just hides the bug. It >> doesn't fix it. >> As the rdev is now Faulty, it doesn't make sense for >> sync_request_write() to submit a write request to it. > > Make sense, while still have concerns regarding the design: > * in this case since the read_disk already abandoned, is it fine to > keep r1_bio->read_disk recording the faulty device index? I guess we could set it to -1. I'm not sure that would help at all. > * we assign the 'end_sync_write' to the original read bio in this > case, but when is this supposed to be called? It isn't called. But the value of ->bi_end_io is tests a couple of times. Particularly in put_buf(), but also a little further down in fix_sync_read_errors(). > >> >> Can you confirm that this works please. > > Yes, it works. > > Tested-by: Michael Wang <yun.wang@profitbricks.com> Thanks. I'll add that and submit the patch. Thanks, NeilBrown > > Regards, > Michael Wang > >> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index d2d8b8a5bd56..219f1e1f1d1d 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio) >> (i == r1_bio->read_disk || >> !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)))) >> continue; >> + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags)) >> + continue; >> >> bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); >> if (test_bit(FailFast, &conf->mirrors[i].rdev->flags)) >> >> >> Thanks, >> NeilBrown >> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-04-06 2:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-04-04 13:50 [RFC PATCH] raid1: reset 'bi_next' before reuse the bio Michael Wang 2017-04-04 22:17 ` NeilBrown 2017-04-04 22:17 ` NeilBrown 2017-04-05 7:40 ` Michael Wang 2017-04-06 2:03 ` NeilBrown 2017-04-06 2:03 ` NeilBrown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.