[PATCH v3 0/4] raid10 bugfix

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/4] raid10 bugfix
@ 2023-05-27  7:22 linan666
  2023-05-27  7:22 ` [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request linan666
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

Changes in v3:
 - in patch 1, change check 'mreplace != NULL' to 'mreplace', improve log.
 - in patch 2, change check 'mrdev != NULL' to 'mrdev'.

Changes in v2:
 - add patch "md/raid10: improve code of mrdev in raid10_sync_request".
 - in patch 3, break lines longer than 80 columns and improve log.

Li Nan (4):
  md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request
  md/raid10: improve code of mrdev in raid10_sync_request
  md/raid10: fix incorrect done of recovery
  md/raid10: fix io loss while replacement replace rdev

 drivers/md/raid10.c | 75 ++++++++++++++++++++++++++++++---------------
 1 file changed, 51 insertions(+), 24 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request
  2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
@ 2023-05-27  7:22 ` linan666
  2023-05-27  7:22 ` [PATCH v3 2/4] md/raid10: improve code of mrdev " linan666
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

There are two check of 'mreplace' in raid10_sync_request(). In the first
check, 'need_replace' will be set and 'mreplace' will be used later if
no-Faulty 'mreplace' exists, In the second check, 'mreplace' will be
set to NULL if it is Faulty, but 'need_replace' will not be changed
accordingly. null-ptr-deref occurs if Faulty is set between two check.

Fix it by merging two checks into one. And replace 'need_replace' with
'mreplace' because their values are always the same.

Fixes: ee37d7314a32 ("md/raid10: Fix raid10 replace hang when new added disk faulty")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 4fcfcb350d2b..0acb4c103c10 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3438,7 +3438,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			int must_sync;
 			int any_working;
 			int need_recover = 0;
-			int need_replace = 0;
 			struct raid10_info *mirror = &conf->mirrors[i];
 			struct md_rdev *mrdev, *mreplace;
 
@@ -3450,11 +3449,10 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			    !test_bit(Faulty, &mrdev->flags) &&
 			    !test_bit(In_sync, &mrdev->flags))
 				need_recover = 1;
-			if (mreplace != NULL &&
-			    !test_bit(Faulty, &mreplace->flags))
-				need_replace = 1;
+			if (mreplace && test_bit(Faulty, &mreplace->flags))
+				mreplace = NULL;
 
-			if (!need_recover && !need_replace) {
+			if (!need_recover && !mreplace) {
 				rcu_read_unlock();
 				continue;
 			}
@@ -3470,8 +3468,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				rcu_read_unlock();
 				continue;
 			}
-			if (mreplace && test_bit(Faulty, &mreplace->flags))
-				mreplace = NULL;
 			/* Unless we are doing a full sync, or a replacement
 			 * we only need to recover the block if it is set in
 			 * the bitmap
@@ -3594,11 +3590,11 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				bio = r10_bio->devs[1].repl_bio;
 				if (bio)
 					bio->bi_end_io = NULL;
-				/* Note: if need_replace, then bio
+				/* Note: if replace is not NULL, then bio
 				 * cannot be NULL as r10buf_pool_alloc will
 				 * have allocated it.
 				 */
-				if (!need_replace)
+				if (!mreplace)
 					break;
 				bio->bi_next = biolist;
 				biolist = bio;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 2/4] md/raid10: improve code of mrdev in raid10_sync_request
  2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
  2023-05-27  7:22 ` [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request linan666
@ 2023-05-27  7:22 ` linan666
  2023-05-27  7:22 ` [PATCH v3 3/4] md/raid10: fix incorrect done of recovery linan666
  2023-05-27  7:22 ` [PATCH v3 4/4] md/raid10: fix io loss while replacement replace rdev linan666
  3 siblings, 0 replies; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

'need_recover' and 'mrdev' are equivalent in raid10_sync_request(), and
inc mrdev->nr_pending is unreasonable if don't need recovery. Replace
'need_recover' with 'mrdev', and only inc nr_pending when needed.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0acb4c103c10..d93d8cb2b620 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3437,7 +3437,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			sector_t sect;
 			int must_sync;
 			int any_working;
-			int need_recover = 0;
 			struct raid10_info *mirror = &conf->mirrors[i];
 			struct md_rdev *mrdev, *mreplace;
 
@@ -3445,14 +3444,13 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			mrdev = rcu_dereference(mirror->rdev);
 			mreplace = rcu_dereference(mirror->replacement);
 
-			if (mrdev != NULL &&
-			    !test_bit(Faulty, &mrdev->flags) &&
-			    !test_bit(In_sync, &mrdev->flags))
-				need_recover = 1;
+			if (mrdev && (test_bit(Faulty, &mrdev->flags) ||
+			    test_bit(In_sync, &mrdev->flags)))
+				mrdev = NULL;
 			if (mreplace && test_bit(Faulty, &mreplace->flags))
 				mreplace = NULL;
 
-			if (!need_recover && !mreplace) {
+			if (!mrdev && !mreplace) {
 				rcu_read_unlock();
 				continue;
 			}
@@ -3486,7 +3484,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				rcu_read_unlock();
 				continue;
 			}
-			atomic_inc(&mrdev->nr_pending);
+			if (mrdev)
+				atomic_inc(&mrdev->nr_pending);
 			if (mreplace)
 				atomic_inc(&mreplace->nr_pending);
 			rcu_read_unlock();
@@ -3573,7 +3572,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				r10_bio->devs[1].devnum = i;
 				r10_bio->devs[1].addr = to_addr;
 
-				if (need_recover) {
+				if (mrdev) {
 					bio = r10_bio->devs[1].bio;
 					bio->bi_next = biolist;
 					biolist = bio;
@@ -3618,7 +3617,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 					for (k = 0; k < conf->copies; k++)
 						if (r10_bio->devs[k].devnum == i)
 							break;
-					if (!test_bit(In_sync,
+					if (mrdev && !test_bit(In_sync,
 						      &mrdev->flags)
 					    && !rdev_set_badblocks(
 						    mrdev,
@@ -3644,12 +3643,14 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				if (rb2)
 					atomic_dec(&rb2->remaining);
 				r10_bio = rb2;
-				rdev_dec_pending(mrdev, mddev);
+				if (mrdev)
+					rdev_dec_pending(mrdev, mddev);
 				if (mreplace)
 					rdev_dec_pending(mreplace, mddev);
 				break;
 			}
-			rdev_dec_pending(mrdev, mddev);
+			if (mrdev)
+				rdev_dec_pending(mrdev, mddev);
 			if (mreplace)
 				rdev_dec_pending(mreplace, mddev);
 			if (r10_bio->devs[0].bio->bi_opf & MD_FAILFAST) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 3/4] md/raid10: fix incorrect done of recovery
  2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
  2023-05-27  7:22 ` [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request linan666
  2023-05-27  7:22 ` [PATCH v3 2/4] md/raid10: improve code of mrdev " linan666
@ 2023-05-27  7:22 ` linan666
  2023-05-30 21:55   ` Song Liu
  2023-05-27  7:22 ` [PATCH v3 4/4] md/raid10: fix io loss while replacement replace rdev linan666
  3 siblings, 1 reply; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

Recovery will go to giveup and let chunks_skipped++ in
raid10_sync_request() if there are some bad_blocks, and it will return
max_sector when chunks_skipped >= geo.raid_disks. Now, recovery fail and
data is inconsistent but user think recovery is done, it is wrong.

Fix it by set mirror's recovery_disabled, spare device will not  be added
to here. The same issue alos exists on resync, it will be fixd in future.

Signed-off-by: Li Nan <linan122@huawei.com>
---
 drivers/md/raid10.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d93d8cb2b620..3ba1516ea160 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 	int chunks_skipped = 0;
 	sector_t chunk_mask = conf->geo.chunk_mask;
 	int page_idx = 0;
+	int error_disk = -1;
 
 	/*
 	 * Allow skipping a full rebuild for incremental assembly
@@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 		return reshape_request(mddev, sector_nr, skipped);
 
 	if (chunks_skipped >= conf->geo.raid_disks) {
-		/* if there has been nothing to do on any drive,
+		pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
+			test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" : "recovery");
+		if (error_disk >= 0 &&
+		    !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+			/*
+			 * recovery fail, set mirrors.recovory_disabled,
+			 * device shouldn't be added to there.
+			 */
+			conf->mirrors[error_disk].recovery_disabled =
+						mddev->recovery_disabled;
+			return 0;
+		}
+		/*
+		 * if there has been nothing to do on any drive,
 		 * then there is nothing to do at all..
 		 */
 		*skipped = 1;
@@ -3638,6 +3652,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 						       mdname(mddev));
 					mirror->recovery_disabled
 						= mddev->recovery_disabled;
+				} else {
+					error_disk = i;
 				}
 				put_buf(r10_bio);
 				if (rb2)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 4/4] md/raid10: fix io loss while replacement replace rdev
  2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
                   ` (2 preceding siblings ...)
  2023-05-27  7:22 ` [PATCH v3 3/4] md/raid10: fix incorrect done of recovery linan666
@ 2023-05-27  7:22 ` linan666
  3 siblings, 0 replies; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

When we remove a disk which has replacement, first set rdev to NULL
and then set replacement to rdev, finally set replacement to NULL (see
raid10_remove_disk()). If io is submitted during the same time, it might
read both rdev and replacement as NULL, and io will not be submitted.

  rdev -> NULL
			read rdev
  replacement -> NULL
			read replacement

Fix it by reading replacement first and rdev later, meanwhile, use smp_mb()
to prevent memory reordering.

Fixes: 475b0321a4df ("md/raid10: writes should get directed to replacement as well as original.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3ba1516ea160..a7830d2c7477 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -779,8 +779,16 @@ static struct md_rdev *read_balance(struct r10conf *conf,
 		disk = r10_bio->devs[slot].devnum;
 		rdev = rcu_dereference(conf->mirrors[disk].replacement);
 		if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
-		    r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
+		    r10_bio->devs[slot].addr + sectors >
+		    rdev->recovery_offset) {
+			/*
+			 * Read replacement first to prevent reading both rdev
+			 * and replacement as NULL during replacement replace
+			 * rdev.
+			 */
+			smp_mb();
 			rdev = rcu_dereference(conf->mirrors[disk].rdev);
+		}
 		if (rdev == NULL ||
 		    test_bit(Faulty, &rdev->flags))
 			continue;
@@ -1479,9 +1487,15 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 
 	for (i = 0;  i < conf->copies; i++) {
 		int d = r10_bio->devs[i].devnum;
-		struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev);
-		struct md_rdev *rrdev = rcu_dereference(
-			conf->mirrors[d].replacement);
+		struct md_rdev *rdev, *rrdev;
+
+		rrdev = rcu_dereference(conf->mirrors[d].replacement);
+		/*
+		 * Read replacement first to Prevent reading both rdev and
+		 * replacement as NULL during replacement replace rdev.
+		 */
+		smp_mb();
+		rdev = rcu_dereference(conf->mirrors[d].rdev);
 		if (rdev == rrdev)
 			rrdev = NULL;
 		if (rdev && (test_bit(Faulty, &rdev->flags)))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/4] md/raid10: fix incorrect done of recovery
  2023-05-27  7:22 ` [PATCH v3 3/4] md/raid10: fix incorrect done of recovery linan666
@ 2023-05-30 21:55   ` Song Liu
  2023-05-31  9:31     ` Li Nan
  0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2023-05-30 21:55 UTC (permalink / raw)
  To: linan666
  Cc: bingjingc, allenpeng, shli, alexwu, neilb, linux-raid,
	linux-kernel, linan122, yukuai3, yi.zhang, houtao1, yangerkun

On Sat, May 27, 2023 at 12:24 AM <linan666@huaweicloud.com> wrote:
>
> From: Li Nan <linan122@huawei.com>
>
> Recovery will go to giveup and let chunks_skipped++ in
> raid10_sync_request() if there are some bad_blocks, and it will return
> max_sector when chunks_skipped >= geo.raid_disks. Now, recovery fail and
> data is inconsistent but user think recovery is done, it is wrong.
>
> Fix it by set mirror's recovery_disabled, spare device will not  be added
> to here. The same issue alos exists on resync, it will be fixd in future.
>
> Signed-off-by: Li Nan <linan122@huawei.com>

I applied 1/4 and 2/4 of the set to md-next.

For 3/4 and 4/4, please improve the commit log (rephrase confusing statements,
fix typo's, etc.). Please also add a mdadm test for 3/4.

Thanks,
Song

> ---
>  drivers/md/raid10.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index d93d8cb2b620..3ba1516ea160 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>         int chunks_skipped = 0;
>         sector_t chunk_mask = conf->geo.chunk_mask;
>         int page_idx = 0;
> +       int error_disk = -1;
>
>         /*
>          * Allow skipping a full rebuild for incremental assembly
> @@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>                 return reshape_request(mddev, sector_nr, skipped);
>
>         if (chunks_skipped >= conf->geo.raid_disks) {
> -               /* if there has been nothing to do on any drive,
> +               pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
> +                       test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" : "recovery");
> +               if (error_disk >= 0 &&
> +                   !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> +                       /*
> +                        * recovery fail, set mirrors.recovory_disabled,
> +                        * device shouldn't be added to there.
> +                        */
> +                       conf->mirrors[error_disk].recovery_disabled =
> +                                               mddev->recovery_disabled;
> +                       return 0;
> +               }
> +               /*
> +                * if there has been nothing to do on any drive,
>                  * then there is nothing to do at all..
>                  */
>                 *skipped = 1;
> @@ -3638,6 +3652,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>                                                        mdname(mddev));
>                                         mirror->recovery_disabled
>                                                 = mddev->recovery_disabled;
> +                               } else {
> +                                       error_disk = i;
>                                 }
>                                 put_buf(r10_bio);
>                                 if (rb2)
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/4] md/raid10: fix incorrect done of recovery
  2023-05-30 21:55   ` Song Liu
@ 2023-05-31  9:31     ` Li Nan
  0 siblings, 0 replies; 7+ messages in thread
From: Li Nan @ 2023-05-31  9:31 UTC (permalink / raw)
  To: Song Liu, linan666
  Cc: bingjingc, allenpeng, shli, alexwu, neilb, linux-raid,
	linux-kernel, yukuai3, yi.zhang, houtao1, yangerkun



在 2023/5/31 5:55, Song Liu 写道:
> On Sat, May 27, 2023 at 12:24 AM <linan666@huaweicloud.com> wrote:
>>
>> From: Li Nan <linan122@huawei.com>
>>
>> Recovery will go to giveup and let chunks_skipped++ in
>> raid10_sync_request() if there are some bad_blocks, and it will return
>> max_sector when chunks_skipped >= geo.raid_disks. Now, recovery fail and
>> data is inconsistent but user think recovery is done, it is wrong.
>>
>> Fix it by set mirror's recovery_disabled, spare device will not  be added
>> to here. The same issue alos exists on resync, it will be fixd in future.
>>
>> Signed-off-by: Li Nan <linan122@huawei.com>
> 
> I applied 1/4 and 2/4 of the set to md-next.
> 
> For 3/4 and 4/4, please improve the commit log (rephrase confusing statements,
> fix typo's, etc.). Please also add a mdadm test for 3/4.
> 
> Thanks,
> Song
> 

I will add a test later. Thanks for review.

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-31  9:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
2023-05-27  7:22 ` [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request linan666
2023-05-27  7:22 ` [PATCH v3 2/4] md/raid10: improve code of mrdev " linan666
2023-05-27  7:22 ` [PATCH v3 3/4] md/raid10: fix incorrect done of recovery linan666
2023-05-30 21:55   ` Song Liu
2023-05-31  9:31     ` Li Nan
2023-05-27  7:22 ` [PATCH v3 4/4] md/raid10: fix io loss while replacement replace rdev linan666

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.