[PATCH v5 0/2] raid10 bugfix

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/2] raid10 bugfix
@ 2023-06-01  6:24 linan666
  2023-06-01  6:24 ` [PATCH v5 1/2] md/raid10: fix incorrect done of recovery linan666
  2023-06-01  6:24 ` [PATCH v5 2/2] md/raid10: fix io loss while replacement replace rdev linan666
  0 siblings, 2 replies; 7+ messages in thread
From: linan666 @ 2023-06-01  6:24 UTC (permalink / raw)
  To: song, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

Changes in v5:
 - v4 send wrong patch, correct and resend.

Changes in v4:
 - improve commit log
 - removed applied patches

Li Nan (2):
  md/raid10: fix incorrect done of recovery
  md/raid10: fix io loss while replacement replace rdev

 drivers/md/raid10.c | 40 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 35 insertions(+), 5 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v5 1/2] md/raid10: fix incorrect done of recovery
  2023-06-01  6:24 [PATCH v5 0/2] raid10 bugfix linan666
@ 2023-06-01  6:24 ` linan666
  2023-06-01  7:06   ` Paul Menzel
  2023-06-01  6:24 ` [PATCH v5 2/2] md/raid10: fix io loss while replacement replace rdev linan666
  1 sibling, 1 reply; 7+ messages in thread
From: linan666 @ 2023-06-01  6:24 UTC (permalink / raw)
  To: song, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

In raid10_sync_request(), if data cannot be read from any disk for
recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
return 'max_sector', indicating that the recovery has been completed.
However, the recovery is just aborted and the data remains inconsistent.

Fix it by setting mirror->recovery_disabled, which will prevent the spare
disk from being added to this mirror. The same issue also exists during
resync, it will be fixed afterwards.

Signed-off-by: Li Nan <linan122@huawei.com>
---
 drivers/md/raid10.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d93d8cb2b620..3ba1516ea160 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 	int chunks_skipped = 0;
 	sector_t chunk_mask = conf->geo.chunk_mask;
 	int page_idx = 0;
+	int error_disk = -1;
 
 	/*
 	 * Allow skipping a full rebuild for incremental assembly
@@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 		return reshape_request(mddev, sector_nr, skipped);
 
 	if (chunks_skipped >= conf->geo.raid_disks) {
-		/* if there has been nothing to do on any drive,
+		pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
+			test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" : "recovery");
+		if (error_disk >= 0 &&
+		    !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+			/*
+			 * recovery fail, set mirrors.recovory_disabled,
+			 * device shouldn't be added to there.
+			 */
+			conf->mirrors[error_disk].recovery_disabled =
+						mddev->recovery_disabled;
+			return 0;
+		}
+		/*
+		 * if there has been nothing to do on any drive,
 		 * then there is nothing to do at all..
 		 */
 		*skipped = 1;
@@ -3638,6 +3652,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 						       mdname(mddev));
 					mirror->recovery_disabled
 						= mddev->recovery_disabled;
+				} else {
+					error_disk = i;
 				}
 				put_buf(r10_bio);
 				if (rb2)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v5 2/2] md/raid10: fix io loss while replacement replace rdev
  2023-06-01  6:24 [PATCH v5 0/2] raid10 bugfix linan666
  2023-06-01  6:24 ` [PATCH v5 1/2] md/raid10: fix incorrect done of recovery linan666
@ 2023-06-01  6:24 ` linan666
  1 sibling, 0 replies; 7+ messages in thread
From: linan666 @ 2023-06-01  6:24 UTC (permalink / raw)
  To: song, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

When removing a disk with replacement, the replacement will be used to
replace rdev. During this process, there is a brief window in which both
rdev and replacement are read as NULL in raid10_write_request(). This
will result in io not being submitted but it should be.

  //remove				//write
  raid10_remove_disk			raid10_write_request
   mirror->rdev = NULL
					 read rdev -> NULL
   mirror->rdev = mirror->replacement
   mirror->replacement = NULL
					 read replacement -> NULL

Fix it by reading replacement first and rdev later, meanwhile, use smp_mb()
to prevent memory reordering.

Fixes: 475b0321a4df ("md/raid10: writes should get directed to replacement as well as original.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3ba1516ea160..a7830d2c7477 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -779,8 +779,16 @@ static struct md_rdev *read_balance(struct r10conf *conf,
 		disk = r10_bio->devs[slot].devnum;
 		rdev = rcu_dereference(conf->mirrors[disk].replacement);
 		if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
-		    r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
+		    r10_bio->devs[slot].addr + sectors >
+		    rdev->recovery_offset) {
+			/*
+			 * Read replacement first to prevent reading both rdev
+			 * and replacement as NULL during replacement replace
+			 * rdev.
+			 */
+			smp_mb();
 			rdev = rcu_dereference(conf->mirrors[disk].rdev);
+		}
 		if (rdev == NULL ||
 		    test_bit(Faulty, &rdev->flags))
 			continue;
@@ -1479,9 +1487,15 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 
 	for (i = 0;  i < conf->copies; i++) {
 		int d = r10_bio->devs[i].devnum;
-		struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev);
-		struct md_rdev *rrdev = rcu_dereference(
-			conf->mirrors[d].replacement);
+		struct md_rdev *rdev, *rrdev;
+
+		rrdev = rcu_dereference(conf->mirrors[d].replacement);
+		/*
+		 * Read replacement first to Prevent reading both rdev and
+		 * replacement as NULL during replacement replace rdev.
+		 */
+		smp_mb();
+		rdev = rcu_dereference(conf->mirrors[d].rdev);
 		if (rdev == rrdev)
 			rrdev = NULL;
 		if (rdev && (test_bit(Faulty, &rdev->flags)))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery
  2023-06-01  6:24 ` [PATCH v5 1/2] md/raid10: fix incorrect done of recovery linan666
@ 2023-06-01  7:06   ` Paul Menzel
  2023-06-02  6:57     ` Li Nan
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Menzel @ 2023-06-01  7:06 UTC (permalink / raw)
  To: linan666
  Cc: song, neilb, linux-raid, linux-kernel, linan122, yukuai3,
	yi.zhang, houtao1, yangerkun

Dear Li,


Thank you for your patch.

Am 01.06.23 um 08:24 schrieb linan666@huaweicloud.com:
> From: Li Nan <linan122@huawei.com>

Unfortunately, I do not understand your commit message summary “fix 
incorrect done of recovery”. Maybe:

Do not add sparse disk when recovery aborts

> In raid10_sync_request(), if data cannot be read from any disk for
> recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
> multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
> return 'max_sector', indicating that the recovery has been completed.
> However, the recovery is just aborted and the data remains inconsistent.
> 
> Fix it by setting mirror->recovery_disabled, which will prevent the spare
> disk from being added to this mirror. The same issue also exists during
> resync, it will be fixed afterwards.
> 
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>   drivers/md/raid10.c | 18 +++++++++++++++++-
>   1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index d93d8cb2b620..3ba1516ea160 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>   	int chunks_skipped = 0;
>   	sector_t chunk_mask = conf->geo.chunk_mask;
>   	int page_idx = 0;
> +	int error_disk = -1;
>   
>   	/*
>   	 * Allow skipping a full rebuild for incremental assembly
> @@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>   		return reshape_request(mddev, sector_nr, skipped);
>   
>   	if (chunks_skipped >= conf->geo.raid_disks) {
> -		/* if there has been nothing to do on any drive,
> +		pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
> +			test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" : "recovery");
> +		if (error_disk >= 0 &&
> +		    !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> +			/*
> +			 * recovery fail, set mirrors.recovory_disabled,

recov*e*ry

> +			 * device shouldn't be added to there.
> +			 */
> +			conf->mirrors[error_disk].recovery_disabled =
> +						mddev->recovery_disabled;
> +			return 0;
> +		}
> +		/*
> +		 * if there has been nothing to do on any drive,
>   		 * then there is nothing to do at all..

Just one dot/period at the end?

>   		 */
>   		*skipped = 1;
> @@ -3638,6 +3652,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>   						       mdname(mddev));
>   					mirror->recovery_disabled
>   						= mddev->recovery_disabled;
> +				} else {
> +					error_disk = i;
>   				}
>   				put_buf(r10_bio);
>   				if (rb2)


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery
  2023-06-01  7:06   ` Paul Menzel
@ 2023-06-02  6:57     ` Li Nan
  2023-06-02  8:33       ` Paul Menzel
  0 siblings, 1 reply; 7+ messages in thread
From: Li Nan @ 2023-06-02  6:57 UTC (permalink / raw)
  To: Paul Menzel, linan666
  Cc: song, neilb, linux-raid, linux-kernel, yukuai3, yi.zhang,
	houtao1, yangerkun



在 2023/6/1 15:06, Paul Menzel 写道:
> Dear Li,
> 
> 
> Thank you for your patch.
> 
> Am 01.06.23 um 08:24 schrieb linan666@huaweicloud.com:
>> From: Li Nan <linan122@huawei.com>
> 
> Unfortunately, I do not understand your commit message summary “fix 
> incorrect done of recovery”. Maybe:
> 
> Do not add sparse disk when recovery aborts
> 

"recovery fail" is better?

>> In raid10_sync_request(), if data cannot be read from any disk for
>> recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
>> multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
>> return 'max_sector', indicating that the recovery has been completed.
>> However, the recovery is just aborted and the data remains inconsistent.
>>
>> Fix it by setting mirror->recovery_disabled, which will prevent the spare
>> disk from being added to this mirror. The same issue also exists during
>> resync, it will be fixed afterwards.
>>
>> Signed-off-by: Li Nan <linan122@huawei.com>
>> ---
>>   drivers/md/raid10.c | 18 +++++++++++++++++-
>>   1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index d93d8cb2b620..3ba1516ea160 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev 
>> *mddev, sector_t sector_nr,
>>       int chunks_skipped = 0;
>>       sector_t chunk_mask = conf->geo.chunk_mask;
>>       int page_idx = 0;
>> +    int error_disk = -1;
>>       /*
>>        * Allow skipping a full rebuild for incremental assembly
>> @@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct 
>> mddev *mddev, sector_t sector_nr,
>>           return reshape_request(mddev, sector_nr, skipped);
>>       if (chunks_skipped >= conf->geo.raid_disks) {
>> -        /* if there has been nothing to do on any drive,
>> +        pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
>> +            test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" 
>> : "recovery");
>> +        if (error_disk >= 0 &&
>> +            !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
>> +            /*
>> +             * recovery fail, set mirrors.recovory_disabled,
> 
> recov*e*ry
> 
>> +             * device shouldn't be added to there.
>> +             */
>> +            conf->mirrors[error_disk].recovery_disabled =
>> +                        mddev->recovery_disabled;
>> +            return 0;
>> +        }
>> +        /*
>> +         * if there has been nothing to do on any drive,
>>            * then there is nothing to do at all..
> 
> Just one dot/period at the end?
>
Thanks for your suggestion. I will change it in next version.

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery
  2023-06-02  6:57     ` Li Nan
@ 2023-06-02  8:33       ` Paul Menzel
  2023-06-02  9:23         ` Li Nan
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Menzel @ 2023-06-02  8:33 UTC (permalink / raw)
  To: Li Nan
  Cc: song, neilb, linux-raid, linux-kernel, yukuai3, yi.zhang,
	houtao1, yangerkun

Dear Li Nan,


Am 02.06.23 um 08:57 schrieb Li Nan:

> 在 2023/6/1 15:06, Paul Menzel 写道:

>> Am 01.06.23 um 08:24 schrieb linan666@huaweicloud.com:
>>> From: Li Nan <linan122@huawei.com>
>>
>> Unfortunately, I do not understand your commit message summary “fix 
>> incorrect done of recovery”. Maybe:
>>
>> Do not add sparse disk when recovery aborts
> 
> "recovery fail" is better?

I think the grammar is incorrect, and it should be fail*s*.

[…]


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery
  2023-06-02  8:33       ` Paul Menzel
@ 2023-06-02  9:23         ` Li Nan
  0 siblings, 0 replies; 7+ messages in thread
From: Li Nan @ 2023-06-02  9:23 UTC (permalink / raw)
  To: Paul Menzel, Li Nan
  Cc: song, neilb, linux-raid, linux-kernel, yukuai3, yi.zhang,
	houtao1, yangerkun



在 2023/6/2 16:33, Paul Menzel 写道:
> Dear Li Nan,
> 
> 
> Am 02.06.23 um 08:57 schrieb Li Nan:
> 
>> 在 2023/6/1 15:06, Paul Menzel 写道:
> 
>>> Am 01.06.23 um 08:24 schrieb linan666@huaweicloud.com:
>>>> From: Li Nan <linan122@huawei.com>
>>>
>>> Unfortunately, I do not understand your commit message summary “fix 
>>> incorrect done of recovery”. Maybe:
>>>
>>> Do not add sparse disk when recovery aborts
>>
>> "recovery fail" is better?
> 
> I think the grammar is incorrect, and it should be fail*s*.
> 

fix in v7. :)

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-06-02  9:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-01  6:24 [PATCH v5 0/2] raid10 bugfix linan666
2023-06-01  6:24 ` [PATCH v5 1/2] md/raid10: fix incorrect done of recovery linan666
2023-06-01  7:06   ` Paul Menzel
2023-06-02  6:57     ` Li Nan
2023-06-02  8:33       ` Paul Menzel
2023-06-02  9:23         ` Li Nan
2023-06-01  6:24 ` [PATCH v5 2/2] md/raid10: fix io loss while replacement replace rdev linan666

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.