All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
@ 2017-04-04 13:50 Michael Wang
  2017-04-04 22:17   ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Wang @ 2017-04-04 13:50 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: Shaohua Li, NeilBrown, Jinpu Wang


During the testing we found the sync read bio can go through
path:

  md_do_sync()
    sync_request()
      generic_make_request()
        blk_queue_bio()
          blk_attempt_plug_merge()
            bio->bi_next CHAINED HERE

  ...

  raid1d()
    sync_request_write()
      fix_sync_read_error()
        if FailFast && Faulty
          bio->bi_end_io = end_sync_write
      generic_make_request()
        BUG_ON(bio->bi_next)

This need to meet the conditions:
  * bio once merged
  * read disk have FailFast enabled
  * read disk is Faulty

And since the block layer won't reset the 'bi_next' after bio
is done inside request, we hit the BUG like that.

This patch simply reset the bi_next before we reuse it.

Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
---
 drivers/md/raid1.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 		/* Don't try recovering from here - just fail it
 		 * ... unless it is the last working device of course */
 		md_error(mddev, rdev);
-		if (test_bit(Faulty, &rdev->flags))
+		if (test_bit(Faulty, &rdev->flags)) {
 			/* Don't try to read from here, but make sure
 			 * put_buf does it's thing
 			 */
 			bio->bi_end_io = end_sync_write;
+			bio->bi_next = NULL;
+		}
 	}
 
 	while(sectors) {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
  2017-04-04 13:50 [RFC PATCH] raid1: reset 'bi_next' before reuse the bio Michael Wang
@ 2017-04-04 22:17   ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-04-04 22:17 UTC (permalink / raw)
  To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang

[-- Attachment #1: Type: text/plain, Size: 2653 bytes --]

On Tue, Apr 04 2017, Michael Wang wrote:

> During the testing we found the sync read bio can go through
> path:
>
>   md_do_sync()
>     sync_request()
>       generic_make_request()
>         blk_queue_bio()
>           blk_attempt_plug_merge()
>             bio->bi_next CHAINED HERE
>
>   ...
>
>   raid1d()
>     sync_request_write()
>       fix_sync_read_error()
>         if FailFast && Faulty
>           bio->bi_end_io = end_sync_write
>       generic_make_request()
>         BUG_ON(bio->bi_next)
>
> This need to meet the conditions:
>   * bio once merged
>   * read disk have FailFast enabled
>   * read disk is Faulty
>
> And since the block layer won't reset the 'bi_next' after bio
> is done inside request, we hit the BUG like that.
>
> This patch simply reset the bi_next before we reuse it.
>
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
>  drivers/md/raid1.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>  		/* Don't try recovering from here - just fail it
>  		 * ... unless it is the last working device of course */
>  		md_error(mddev, rdev);
> -		if (test_bit(Faulty, &rdev->flags))
> +		if (test_bit(Faulty, &rdev->flags)) {
>  			/* Don't try to read from here, but make sure
>  			 * put_buf does it's thing
>  			 */
>  			bio->bi_end_io = end_sync_write;
> +			bio->bi_next = NULL;
> +		}
>  	}
>  
>  	while(sectors) {


Ah - I see what is happening now.  I was looking at the vanilla 4.4
code, which doesn't have the failfast changes.

I don't think your patch is correct though.  We really shouldn't be
re-using that bio, and setting bi_next to NULL just hides the bug.  It
doesn't fix it.
As the rdev is now Faulty, it doesn't make sense for
sync_request_write() to submit a write request to it.

Can you confirm that this works please.

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d2d8b8a5bd56..219f1e1f1d1d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 		     (i == r1_bio->read_disk ||
 		      !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
 			continue;
+		if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
+			continue;
 
 		bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
 		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
@ 2017-04-04 22:17   ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-04-04 22:17 UTC (permalink / raw)
  To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang

[-- Attachment #1: Type: text/plain, Size: 2653 bytes --]

On Tue, Apr 04 2017, Michael Wang wrote:

> During the testing we found the sync read bio can go through
> path:
>
>   md_do_sync()
>     sync_request()
>       generic_make_request()
>         blk_queue_bio()
>           blk_attempt_plug_merge()
>             bio->bi_next CHAINED HERE
>
>   ...
>
>   raid1d()
>     sync_request_write()
>       fix_sync_read_error()
>         if FailFast && Faulty
>           bio->bi_end_io = end_sync_write
>       generic_make_request()
>         BUG_ON(bio->bi_next)
>
> This need to meet the conditions:
>   * bio once merged
>   * read disk have FailFast enabled
>   * read disk is Faulty
>
> And since the block layer won't reset the 'bi_next' after bio
> is done inside request, we hit the BUG like that.
>
> This patch simply reset the bi_next before we reuse it.
>
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
>  drivers/md/raid1.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>  		/* Don't try recovering from here - just fail it
>  		 * ... unless it is the last working device of course */
>  		md_error(mddev, rdev);
> -		if (test_bit(Faulty, &rdev->flags))
> +		if (test_bit(Faulty, &rdev->flags)) {
>  			/* Don't try to read from here, but make sure
>  			 * put_buf does it's thing
>  			 */
>  			bio->bi_end_io = end_sync_write;
> +			bio->bi_next = NULL;
> +		}
>  	}
>  
>  	while(sectors) {


Ah - I see what is happening now.  I was looking at the vanilla 4.4
code, which doesn't have the failfast changes.

I don't think your patch is correct though.  We really shouldn't be
re-using that bio, and setting bi_next to NULL just hides the bug.  It
doesn't fix it.
As the rdev is now Faulty, it doesn't make sense for
sync_request_write() to submit a write request to it.

Can you confirm that this works please.

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d2d8b8a5bd56..219f1e1f1d1d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 		     (i == r1_bio->read_disk ||
 		      !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
 			continue;
+		if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
+			continue;
 
 		bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
 		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
  2017-04-04 22:17   ` NeilBrown
  (?)
@ 2017-04-05  7:40   ` Michael Wang
  2017-04-06  2:03       ` NeilBrown
  -1 siblings, 1 reply; 6+ messages in thread
From: Michael Wang @ 2017-04-05  7:40 UTC (permalink / raw)
  To: NeilBrown, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang



On 04/05/2017 12:17 AM, NeilBrown wrote:
[snip]
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 7d67235..0554110 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>  		/* Don't try recovering from here - just fail it
>>  		 * ... unless it is the last working device of course */
>>  		md_error(mddev, rdev);
>> -		if (test_bit(Faulty, &rdev->flags))
>> +		if (test_bit(Faulty, &rdev->flags)) {
>>  			/* Don't try to read from here, but make sure
>>  			 * put_buf does it's thing
>>  			 */
>>  			bio->bi_end_io = end_sync_write;
>> +			bio->bi_next = NULL;
>> +		}
>>  	}
>>  
>>  	while(sectors) {
> 
> 
> Ah - I see what is happening now.  I was looking at the vanilla 4.4
> code, which doesn't have the failfast changes.

My bad to forgot mention... yes our md stuff is very much close to the
upstream.

> 
> I don't think your patch is correct though.  We really shouldn't be
> re-using that bio, and setting bi_next to NULL just hides the bug.  It
> doesn't fix it.
> As the rdev is now Faulty, it doesn't make sense for
> sync_request_write() to submit a write request to it.

Make sense, while still have concerns regarding the design:
  * in this case since the read_disk already abandoned, is it fine to
    keep r1_bio->read_disk recording the faulty device index?
  * we assign the 'end_sync_write' to the original read bio in this
    case, but when is this supposed to be called?

> 
> Can you confirm that this works please.

Yes, it works.

Tested-by: Michael Wang <yun.wang@profitbricks.com>

Regards,
Michael Wang

> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d2d8b8a5bd56..219f1e1f1d1d 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>  		     (i == r1_bio->read_disk ||
>  		      !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
>  			continue;
> +		if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
> +			continue;
>  
>  		bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
>  		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
> 
> 
> Thanks,
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
  2017-04-05  7:40   ` Michael Wang
@ 2017-04-06  2:03       ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-04-06  2:03 UTC (permalink / raw)
  To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang

[-- Attachment #1: Type: text/plain, Size: 2743 bytes --]

On Wed, Apr 05 2017, Michael Wang wrote:

> On 04/05/2017 12:17 AM, NeilBrown wrote:
> [snip]
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 7d67235..0554110 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>  		/* Don't try recovering from here - just fail it
>>>  		 * ... unless it is the last working device of course */
>>>  		md_error(mddev, rdev);
>>> -		if (test_bit(Faulty, &rdev->flags))
>>> +		if (test_bit(Faulty, &rdev->flags)) {
>>>  			/* Don't try to read from here, but make sure
>>>  			 * put_buf does it's thing
>>>  			 */
>>>  			bio->bi_end_io = end_sync_write;
>>> +			bio->bi_next = NULL;
>>> +		}
>>>  	}
>>>  
>>>  	while(sectors) {
>> 
>> 
>> Ah - I see what is happening now.  I was looking at the vanilla 4.4
>> code, which doesn't have the failfast changes.
>
> My bad to forgot mention... yes our md stuff is very much close to the
> upstream.
>
>> 
>> I don't think your patch is correct though.  We really shouldn't be
>> re-using that bio, and setting bi_next to NULL just hides the bug.  It
>> doesn't fix it.
>> As the rdev is now Faulty, it doesn't make sense for
>> sync_request_write() to submit a write request to it.
>
> Make sense, while still have concerns regarding the design:
>   * in this case since the read_disk already abandoned, is it fine to
>     keep r1_bio->read_disk recording the faulty device index?

I guess we could set it to -1.  I'm not sure that would help at all.


>   * we assign the 'end_sync_write' to the original read bio in this
>     case, but when is this supposed to be called?

It isn't called.  But the value of ->bi_end_io is tests a couple of
times.  Particularly in put_buf(), but also a little further down in
fix_sync_read_errors(). 

>
>> 
>> Can you confirm that this works please.
>
> Yes, it works.
>
> Tested-by: Michael Wang <yun.wang@profitbricks.com>

Thanks.  I'll add that and submit the patch.

Thanks,
NeilBrown

>
> Regards,
> Michael Wang
>
>> 
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index d2d8b8a5bd56..219f1e1f1d1d 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>>  		     (i == r1_bio->read_disk ||
>>  		      !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
>>  			continue;
>> +		if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
>> +			continue;
>>  
>>  		bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
>>  		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
>> 
>> 
>> Thanks,
>> NeilBrown
>> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
@ 2017-04-06  2:03       ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-04-06  2:03 UTC (permalink / raw)
  To: Michael Wang, linux-raid, linux-kernel; +Cc: Shaohua Li, Jinpu Wang

[-- Attachment #1: Type: text/plain, Size: 2743 bytes --]

On Wed, Apr 05 2017, Michael Wang wrote:

> On 04/05/2017 12:17 AM, NeilBrown wrote:
> [snip]
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 7d67235..0554110 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>  		/* Don't try recovering from here - just fail it
>>>  		 * ... unless it is the last working device of course */
>>>  		md_error(mddev, rdev);
>>> -		if (test_bit(Faulty, &rdev->flags))
>>> +		if (test_bit(Faulty, &rdev->flags)) {
>>>  			/* Don't try to read from here, but make sure
>>>  			 * put_buf does it's thing
>>>  			 */
>>>  			bio->bi_end_io = end_sync_write;
>>> +			bio->bi_next = NULL;
>>> +		}
>>>  	}
>>>  
>>>  	while(sectors) {
>> 
>> 
>> Ah - I see what is happening now.  I was looking at the vanilla 4.4
>> code, which doesn't have the failfast changes.
>
> My bad to forgot mention... yes our md stuff is very much close to the
> upstream.
>
>> 
>> I don't think your patch is correct though.  We really shouldn't be
>> re-using that bio, and setting bi_next to NULL just hides the bug.  It
>> doesn't fix it.
>> As the rdev is now Faulty, it doesn't make sense for
>> sync_request_write() to submit a write request to it.
>
> Make sense, while still have concerns regarding the design:
>   * in this case since the read_disk already abandoned, is it fine to
>     keep r1_bio->read_disk recording the faulty device index?

I guess we could set it to -1.  I'm not sure that would help at all.


>   * we assign the 'end_sync_write' to the original read bio in this
>     case, but when is this supposed to be called?

It isn't called.  But the value of ->bi_end_io is tests a couple of
times.  Particularly in put_buf(), but also a little further down in
fix_sync_read_errors(). 

>
>> 
>> Can you confirm that this works please.
>
> Yes, it works.
>
> Tested-by: Michael Wang <yun.wang@profitbricks.com>

Thanks.  I'll add that and submit the patch.

Thanks,
NeilBrown

>
> Regards,
> Michael Wang
>
>> 
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index d2d8b8a5bd56..219f1e1f1d1d 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>>  		     (i == r1_bio->read_disk ||
>>  		      !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
>>  			continue;
>> +		if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
>> +			continue;
>>  
>>  		bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
>>  		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
>> 
>> 
>> Thanks,
>> NeilBrown
>> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-04-06  2:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-04 13:50 [RFC PATCH] raid1: reset 'bi_next' before reuse the bio Michael Wang
2017-04-04 22:17 ` NeilBrown
2017-04-04 22:17   ` NeilBrown
2017-04-05  7:40   ` Michael Wang
2017-04-06  2:03     ` NeilBrown
2017-04-06  2:03       ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.