* [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
@ 2017-04-04 13:50 Michael Wang
2017-04-04 22:17 ` NeilBrown
0 siblings, 1 reply; 6+ messages in thread
From: Michael Wang @ 2017-04-04 13:50 UTC (permalink / raw)
To: linux-raid, linux-kernel; +Cc: Shaohua Li, NeilBrown, Jinpu Wang
During the testing we found the sync read bio can go through
path:
md_do_sync()
sync_request()
generic_make_request()
blk_queue_bio()
blk_attempt_plug_merge()
bio->bi_next CHAINED HERE
...
raid1d()
sync_request_write()
fix_sync_read_error()
if FailFast && Faulty
bio->bi_end_io = end_sync_write
generic_make_request()
BUG_ON(bio->bi_next)
This need to meet the conditions:
* bio once merged
* read disk have FailFast enabled
* read disk is Faulty
And since the block layer won't reset the 'bi_next' after bio
is done inside request, we hit the BUG like that.
This patch simply reset the bi_next before we reuse it.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
---
drivers/md/raid1.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
/* Don't try recovering from here - just fail it
* ... unless it is the last working device of course */
md_error(mddev, rdev);
- if (test_bit(Faulty, &rdev->flags))
+ if (test_bit(Faulty, &rdev->flags)) {
/* Don't try to read from here, but make sure
* put_buf does it's thing
*/
bio->bi_end_io = end_sync_write;
+ bio->bi_next = NULL;
+ }
}
while(sectors) {
--
2.5.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
2017-04-04 13:50 [RFC PATCH] raid1: reset 'bi_next' before reuse the bio Michael Wang
@ 2017-04-04 22:17 ` NeilBrown
0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-04-04 22:17 UTC (permalink / raw)
To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang
[-- Attachment #1: Type: text/plain, Size: 2653 bytes --]
On Tue, Apr 04 2017, Michael Wang wrote:
> During the testing we found the sync read bio can go through
> path:
>
> md_do_sync()
> sync_request()
> generic_make_request()
> blk_queue_bio()
> blk_attempt_plug_merge()
> bio->bi_next CHAINED HERE
>
> ...
>
> raid1d()
> sync_request_write()
> fix_sync_read_error()
> if FailFast && Faulty
> bio->bi_end_io = end_sync_write
> generic_make_request()
> BUG_ON(bio->bi_next)
>
> This need to meet the conditions:
> * bio once merged
> * read disk have FailFast enabled
> * read disk is Faulty
>
> And since the block layer won't reset the 'bi_next' after bio
> is done inside request, we hit the BUG like that.
>
> This patch simply reset the bi_next before we reuse it.
>
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
> drivers/md/raid1.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
> /* Don't try recovering from here - just fail it
> * ... unless it is the last working device of course */
> md_error(mddev, rdev);
> - if (test_bit(Faulty, &rdev->flags))
> + if (test_bit(Faulty, &rdev->flags)) {
> /* Don't try to read from here, but make sure
> * put_buf does it's thing
> */
> bio->bi_end_io = end_sync_write;
> + bio->bi_next = NULL;
> + }
> }
>
> while(sectors) {
Ah - I see what is happening now. I was looking at the vanilla 4.4
code, which doesn't have the failfast changes.
I don't think your patch is correct though. We really shouldn't be
re-using that bio, and setting bi_next to NULL just hides the bug. It
doesn't fix it.
As the rdev is now Faulty, it doesn't make sense for
sync_request_write() to submit a write request to it.
Can you confirm that this works please.
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d2d8b8a5bd56..219f1e1f1d1d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
(i == r1_bio->read_disk ||
!test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
continue;
+ if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
+ continue;
bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
@ 2017-04-04 22:17 ` NeilBrown
0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-04-04 22:17 UTC (permalink / raw)
To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang
[-- Attachment #1: Type: text/plain, Size: 2653 bytes --]
On Tue, Apr 04 2017, Michael Wang wrote:
> During the testing we found the sync read bio can go through
> path:
>
> md_do_sync()
> sync_request()
> generic_make_request()
> blk_queue_bio()
> blk_attempt_plug_merge()
> bio->bi_next CHAINED HERE
>
> ...
>
> raid1d()
> sync_request_write()
> fix_sync_read_error()
> if FailFast && Faulty
> bio->bi_end_io = end_sync_write
> generic_make_request()
> BUG_ON(bio->bi_next)
>
> This need to meet the conditions:
> * bio once merged
> * read disk have FailFast enabled
> * read disk is Faulty
>
> And since the block layer won't reset the 'bi_next' after bio
> is done inside request, we hit the BUG like that.
>
> This patch simply reset the bi_next before we reuse it.
>
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
> drivers/md/raid1.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
> /* Don't try recovering from here - just fail it
> * ... unless it is the last working device of course */
> md_error(mddev, rdev);
> - if (test_bit(Faulty, &rdev->flags))
> + if (test_bit(Faulty, &rdev->flags)) {
> /* Don't try to read from here, but make sure
> * put_buf does it's thing
> */
> bio->bi_end_io = end_sync_write;
> + bio->bi_next = NULL;
> + }
> }
>
> while(sectors) {
Ah - I see what is happening now. I was looking at the vanilla 4.4
code, which doesn't have the failfast changes.
I don't think your patch is correct though. We really shouldn't be
re-using that bio, and setting bi_next to NULL just hides the bug. It
doesn't fix it.
As the rdev is now Faulty, it doesn't make sense for
sync_request_write() to submit a write request to it.
Can you confirm that this works please.
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d2d8b8a5bd56..219f1e1f1d1d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
(i == r1_bio->read_disk ||
!test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
continue;
+ if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
+ continue;
bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
2017-04-04 22:17 ` NeilBrown
(?)
@ 2017-04-05 7:40 ` Michael Wang
2017-04-06 2:03 ` NeilBrown
-1 siblings, 1 reply; 6+ messages in thread
From: Michael Wang @ 2017-04-05 7:40 UTC (permalink / raw)
To: NeilBrown, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang
On 04/05/2017 12:17 AM, NeilBrown wrote:
[snip]
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 7d67235..0554110 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>> /* Don't try recovering from here - just fail it
>> * ... unless it is the last working device of course */
>> md_error(mddev, rdev);
>> - if (test_bit(Faulty, &rdev->flags))
>> + if (test_bit(Faulty, &rdev->flags)) {
>> /* Don't try to read from here, but make sure
>> * put_buf does it's thing
>> */
>> bio->bi_end_io = end_sync_write;
>> + bio->bi_next = NULL;
>> + }
>> }
>>
>> while(sectors) {
>
>
> Ah - I see what is happening now. I was looking at the vanilla 4.4
> code, which doesn't have the failfast changes.
My bad to forgot mention... yes our md stuff is very much close to the
upstream.
>
> I don't think your patch is correct though. We really shouldn't be
> re-using that bio, and setting bi_next to NULL just hides the bug. It
> doesn't fix it.
> As the rdev is now Faulty, it doesn't make sense for
> sync_request_write() to submit a write request to it.
Make sense, while still have concerns regarding the design:
* in this case since the read_disk already abandoned, is it fine to
keep r1_bio->read_disk recording the faulty device index?
* we assign the 'end_sync_write' to the original read bio in this
case, but when is this supposed to be called?
>
> Can you confirm that this works please.
Yes, it works.
Tested-by: Michael Wang <yun.wang@profitbricks.com>
Regards,
Michael Wang
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d2d8b8a5bd56..219f1e1f1d1d 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
> (i == r1_bio->read_disk ||
> !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
> continue;
> + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
> + continue;
>
> bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
> if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
>
>
> Thanks,
> NeilBrown
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
2017-04-05 7:40 ` Michael Wang
@ 2017-04-06 2:03 ` NeilBrown
0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-04-06 2:03 UTC (permalink / raw)
To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang
[-- Attachment #1: Type: text/plain, Size: 2743 bytes --]
On Wed, Apr 05 2017, Michael Wang wrote:
> On 04/05/2017 12:17 AM, NeilBrown wrote:
> [snip]
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 7d67235..0554110 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>> /* Don't try recovering from here - just fail it
>>> * ... unless it is the last working device of course */
>>> md_error(mddev, rdev);
>>> - if (test_bit(Faulty, &rdev->flags))
>>> + if (test_bit(Faulty, &rdev->flags)) {
>>> /* Don't try to read from here, but make sure
>>> * put_buf does it's thing
>>> */
>>> bio->bi_end_io = end_sync_write;
>>> + bio->bi_next = NULL;
>>> + }
>>> }
>>>
>>> while(sectors) {
>>
>>
>> Ah - I see what is happening now. I was looking at the vanilla 4.4
>> code, which doesn't have the failfast changes.
>
> My bad to forgot mention... yes our md stuff is very much close to the
> upstream.
>
>>
>> I don't think your patch is correct though. We really shouldn't be
>> re-using that bio, and setting bi_next to NULL just hides the bug. It
>> doesn't fix it.
>> As the rdev is now Faulty, it doesn't make sense for
>> sync_request_write() to submit a write request to it.
>
> Make sense, while still have concerns regarding the design:
> * in this case since the read_disk already abandoned, is it fine to
> keep r1_bio->read_disk recording the faulty device index?
I guess we could set it to -1. I'm not sure that would help at all.
> * we assign the 'end_sync_write' to the original read bio in this
> case, but when is this supposed to be called?
It isn't called. But the value of ->bi_end_io is tests a couple of
times. Particularly in put_buf(), but also a little further down in
fix_sync_read_errors().
>
>>
>> Can you confirm that this works please.
>
> Yes, it works.
>
> Tested-by: Michael Wang <yun.wang@profitbricks.com>
Thanks. I'll add that and submit the patch.
Thanks,
NeilBrown
>
> Regards,
> Michael Wang
>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index d2d8b8a5bd56..219f1e1f1d1d 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>> (i == r1_bio->read_disk ||
>> !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
>> continue;
>> + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
>> + continue;
>>
>> bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
>> if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
>>
>>
>> Thanks,
>> NeilBrown
>>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
@ 2017-04-06 2:03 ` NeilBrown
0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-04-06 2:03 UTC (permalink / raw)
To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang
[-- Attachment #1: Type: text/plain, Size: 2743 bytes --]
On Wed, Apr 05 2017, Michael Wang wrote:
> On 04/05/2017 12:17 AM, NeilBrown wrote:
> [snip]
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 7d67235..0554110 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>> /* Don't try recovering from here - just fail it
>>> * ... unless it is the last working device of course */
>>> md_error(mddev, rdev);
>>> - if (test_bit(Faulty, &rdev->flags))
>>> + if (test_bit(Faulty, &rdev->flags)) {
>>> /* Don't try to read from here, but make sure
>>> * put_buf does it's thing
>>> */
>>> bio->bi_end_io = end_sync_write;
>>> + bio->bi_next = NULL;
>>> + }
>>> }
>>>
>>> while(sectors) {
>>
>>
>> Ah - I see what is happening now. I was looking at the vanilla 4.4
>> code, which doesn't have the failfast changes.
>
> My bad to forgot mention... yes our md stuff is very much close to the
> upstream.
>
>>
>> I don't think your patch is correct though. We really shouldn't be
>> re-using that bio, and setting bi_next to NULL just hides the bug. It
>> doesn't fix it.
>> As the rdev is now Faulty, it doesn't make sense for
>> sync_request_write() to submit a write request to it.
>
> Make sense, while still have concerns regarding the design:
> * in this case since the read_disk already abandoned, is it fine to
> keep r1_bio->read_disk recording the faulty device index?
I guess we could set it to -1. I'm not sure that would help at all.
> * we assign the 'end_sync_write' to the original read bio in this
> case, but when is this supposed to be called?
It isn't called. But the value of ->bi_end_io is tests a couple of
times. Particularly in put_buf(), but also a little further down in
fix_sync_read_errors().
>
>>
>> Can you confirm that this works please.
>
> Yes, it works.
>
> Tested-by: Michael Wang <yun.wang@profitbricks.com>
Thanks. I'll add that and submit the patch.
Thanks,
NeilBrown
>
> Regards,
> Michael Wang
>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index d2d8b8a5bd56..219f1e1f1d1d 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>> (i == r1_bio->read_disk ||
>> !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
>> continue;
>> + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
>> + continue;
>>
>> bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
>> if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
>>
>>
>> Thanks,
>> NeilBrown
>>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-04-06 2:04 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-04 13:50 [RFC PATCH] raid1: reset 'bi_next' before reuse the bio Michael Wang
2017-04-04 22:17 ` NeilBrown
2017-04-04 22:17 ` NeilBrown
2017-04-05 7:40 ` Michael Wang
2017-04-06 2:03 ` NeilBrown
2017-04-06 2:03 ` NeilBrown
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.