linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] raid10 bugfix
@ 2023-05-27  7:22 linan666
  2023-05-27  7:22 ` [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request linan666
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

Changes in v3:
 - in patch 1, change check 'mreplace != NULL' to 'mreplace', improve log.
 - in patch 2, change check 'mrdev != NULL' to 'mrdev'.

Changes in v2:
 - add patch "md/raid10: improve code of mrdev in raid10_sync_request".
 - in patch 3, break lines longer than 80 columns and improve log.

Li Nan (4):
  md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request
  md/raid10: improve code of mrdev in raid10_sync_request
  md/raid10: fix incorrect done of recovery
  md/raid10: fix io loss while replacement replace rdev

 drivers/md/raid10.c | 75 ++++++++++++++++++++++++++++++---------------
 1 file changed, 51 insertions(+), 24 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request
  2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
@ 2023-05-27  7:22 ` linan666
  2023-05-27  7:22 ` [PATCH v3 2/4] md/raid10: improve code of mrdev " linan666
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

There are two check of 'mreplace' in raid10_sync_request(). In the first
check, 'need_replace' will be set and 'mreplace' will be used later if
no-Faulty 'mreplace' exists, In the second check, 'mreplace' will be
set to NULL if it is Faulty, but 'need_replace' will not be changed
accordingly. null-ptr-deref occurs if Faulty is set between two check.

Fix it by merging two checks into one. And replace 'need_replace' with
'mreplace' because their values are always the same.

Fixes: ee37d7314a32 ("md/raid10: Fix raid10 replace hang when new added disk faulty")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 4fcfcb350d2b..0acb4c103c10 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3438,7 +3438,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			int must_sync;
 			int any_working;
 			int need_recover = 0;
-			int need_replace = 0;
 			struct raid10_info *mirror = &conf->mirrors[i];
 			struct md_rdev *mrdev, *mreplace;
 
@@ -3450,11 +3449,10 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			    !test_bit(Faulty, &mrdev->flags) &&
 			    !test_bit(In_sync, &mrdev->flags))
 				need_recover = 1;
-			if (mreplace != NULL &&
-			    !test_bit(Faulty, &mreplace->flags))
-				need_replace = 1;
+			if (mreplace && test_bit(Faulty, &mreplace->flags))
+				mreplace = NULL;
 
-			if (!need_recover && !need_replace) {
+			if (!need_recover && !mreplace) {
 				rcu_read_unlock();
 				continue;
 			}
@@ -3470,8 +3468,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				rcu_read_unlock();
 				continue;
 			}
-			if (mreplace && test_bit(Faulty, &mreplace->flags))
-				mreplace = NULL;
 			/* Unless we are doing a full sync, or a replacement
 			 * we only need to recover the block if it is set in
 			 * the bitmap
@@ -3594,11 +3590,11 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				bio = r10_bio->devs[1].repl_bio;
 				if (bio)
 					bio->bi_end_io = NULL;
-				/* Note: if need_replace, then bio
+				/* Note: if replace is not NULL, then bio
 				 * cannot be NULL as r10buf_pool_alloc will
 				 * have allocated it.
 				 */
-				if (!need_replace)
+				if (!mreplace)
 					break;
 				bio->bi_next = biolist;
 				biolist = bio;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 2/4] md/raid10: improve code of mrdev in raid10_sync_request
  2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
  2023-05-27  7:22 ` [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request linan666
@ 2023-05-27  7:22 ` linan666
  2023-05-27  7:22 ` [PATCH v3 3/4] md/raid10: fix incorrect done of recovery linan666
  2023-05-27  7:22 ` [PATCH v3 4/4] md/raid10: fix io loss while replacement replace rdev linan666
  3 siblings, 0 replies; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

'need_recover' and 'mrdev' are equivalent in raid10_sync_request(), and
inc mrdev->nr_pending is unreasonable if don't need recovery. Replace
'need_recover' with 'mrdev', and only inc nr_pending when needed.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0acb4c103c10..d93d8cb2b620 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3437,7 +3437,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			sector_t sect;
 			int must_sync;
 			int any_working;
-			int need_recover = 0;
 			struct raid10_info *mirror = &conf->mirrors[i];
 			struct md_rdev *mrdev, *mreplace;
 
@@ -3445,14 +3444,13 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			mrdev = rcu_dereference(mirror->rdev);
 			mreplace = rcu_dereference(mirror->replacement);
 
-			if (mrdev != NULL &&
-			    !test_bit(Faulty, &mrdev->flags) &&
-			    !test_bit(In_sync, &mrdev->flags))
-				need_recover = 1;
+			if (mrdev && (test_bit(Faulty, &mrdev->flags) ||
+			    test_bit(In_sync, &mrdev->flags)))
+				mrdev = NULL;
 			if (mreplace && test_bit(Faulty, &mreplace->flags))
 				mreplace = NULL;
 
-			if (!need_recover && !mreplace) {
+			if (!mrdev && !mreplace) {
 				rcu_read_unlock();
 				continue;
 			}
@@ -3486,7 +3484,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				rcu_read_unlock();
 				continue;
 			}
-			atomic_inc(&mrdev->nr_pending);
+			if (mrdev)
+				atomic_inc(&mrdev->nr_pending);
 			if (mreplace)
 				atomic_inc(&mreplace->nr_pending);
 			rcu_read_unlock();
@@ -3573,7 +3572,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				r10_bio->devs[1].devnum = i;
 				r10_bio->devs[1].addr = to_addr;
 
-				if (need_recover) {
+				if (mrdev) {
 					bio = r10_bio->devs[1].bio;
 					bio->bi_next = biolist;
 					biolist = bio;
@@ -3618,7 +3617,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 					for (k = 0; k < conf->copies; k++)
 						if (r10_bio->devs[k].devnum == i)
 							break;
-					if (!test_bit(In_sync,
+					if (mrdev && !test_bit(In_sync,
 						      &mrdev->flags)
 					    && !rdev_set_badblocks(
 						    mrdev,
@@ -3644,12 +3643,14 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				if (rb2)
 					atomic_dec(&rb2->remaining);
 				r10_bio = rb2;
-				rdev_dec_pending(mrdev, mddev);
+				if (mrdev)
+					rdev_dec_pending(mrdev, mddev);
 				if (mreplace)
 					rdev_dec_pending(mreplace, mddev);
 				break;
 			}
-			rdev_dec_pending(mrdev, mddev);
+			if (mrdev)
+				rdev_dec_pending(mrdev, mddev);
 			if (mreplace)
 				rdev_dec_pending(mreplace, mddev);
 			if (r10_bio->devs[0].bio->bi_opf & MD_FAILFAST) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 3/4] md/raid10: fix incorrect done of recovery
  2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
  2023-05-27  7:22 ` [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request linan666
  2023-05-27  7:22 ` [PATCH v3 2/4] md/raid10: improve code of mrdev " linan666
@ 2023-05-27  7:22 ` linan666
  2023-05-30 21:55   ` Song Liu
  2023-05-27  7:22 ` [PATCH v3 4/4] md/raid10: fix io loss while replacement replace rdev linan666
  3 siblings, 1 reply; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

Recovery will go to giveup and let chunks_skipped++ in
raid10_sync_request() if there are some bad_blocks, and it will return
max_sector when chunks_skipped >= geo.raid_disks. Now, recovery fail and
data is inconsistent but user think recovery is done, it is wrong.

Fix it by set mirror's recovery_disabled, spare device will not  be added
to here. The same issue alos exists on resync, it will be fixd in future.

Signed-off-by: Li Nan <linan122@huawei.com>
---
 drivers/md/raid10.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d93d8cb2b620..3ba1516ea160 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 	int chunks_skipped = 0;
 	sector_t chunk_mask = conf->geo.chunk_mask;
 	int page_idx = 0;
+	int error_disk = -1;
 
 	/*
 	 * Allow skipping a full rebuild for incremental assembly
@@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 		return reshape_request(mddev, sector_nr, skipped);
 
 	if (chunks_skipped >= conf->geo.raid_disks) {
-		/* if there has been nothing to do on any drive,
+		pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
+			test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" : "recovery");
+		if (error_disk >= 0 &&
+		    !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+			/*
+			 * recovery fail, set mirrors.recovory_disabled,
+			 * device shouldn't be added to there.
+			 */
+			conf->mirrors[error_disk].recovery_disabled =
+						mddev->recovery_disabled;
+			return 0;
+		}
+		/*
+		 * if there has been nothing to do on any drive,
 		 * then there is nothing to do at all..
 		 */
 		*skipped = 1;
@@ -3638,6 +3652,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 						       mdname(mddev));
 					mirror->recovery_disabled
 						= mddev->recovery_disabled;
+				} else {
+					error_disk = i;
 				}
 				put_buf(r10_bio);
 				if (rb2)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 4/4] md/raid10: fix io loss while replacement replace rdev
  2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
                   ` (2 preceding siblings ...)
  2023-05-27  7:22 ` [PATCH v3 3/4] md/raid10: fix incorrect done of recovery linan666
@ 2023-05-27  7:22 ` linan666
  3 siblings, 0 replies; 7+ messages in thread
From: linan666 @ 2023-05-27  7:22 UTC (permalink / raw)
  To: song, bingjingc, allenpeng, shli, alexwu, neilb
  Cc: linux-raid, linux-kernel, linan122, yukuai3, yi.zhang, houtao1,
	yangerkun

From: Li Nan <linan122@huawei.com>

When we remove a disk which has replacement, first set rdev to NULL
and then set replacement to rdev, finally set replacement to NULL (see
raid10_remove_disk()). If io is submitted during the same time, it might
read both rdev and replacement as NULL, and io will not be submitted.

  rdev -> NULL
			read rdev
  replacement -> NULL
			read replacement

Fix it by reading replacement first and rdev later, meanwhile, use smp_mb()
to prevent memory reordering.

Fixes: 475b0321a4df ("md/raid10: writes should get directed to replacement as well as original.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3ba1516ea160..a7830d2c7477 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -779,8 +779,16 @@ static struct md_rdev *read_balance(struct r10conf *conf,
 		disk = r10_bio->devs[slot].devnum;
 		rdev = rcu_dereference(conf->mirrors[disk].replacement);
 		if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
-		    r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
+		    r10_bio->devs[slot].addr + sectors >
+		    rdev->recovery_offset) {
+			/*
+			 * Read replacement first to prevent reading both rdev
+			 * and replacement as NULL during replacement replace
+			 * rdev.
+			 */
+			smp_mb();
 			rdev = rcu_dereference(conf->mirrors[disk].rdev);
+		}
 		if (rdev == NULL ||
 		    test_bit(Faulty, &rdev->flags))
 			continue;
@@ -1479,9 +1487,15 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 
 	for (i = 0;  i < conf->copies; i++) {
 		int d = r10_bio->devs[i].devnum;
-		struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev);
-		struct md_rdev *rrdev = rcu_dereference(
-			conf->mirrors[d].replacement);
+		struct md_rdev *rdev, *rrdev;
+
+		rrdev = rcu_dereference(conf->mirrors[d].replacement);
+		/*
+		 * Read replacement first to Prevent reading both rdev and
+		 * replacement as NULL during replacement replace rdev.
+		 */
+		smp_mb();
+		rdev = rcu_dereference(conf->mirrors[d].rdev);
 		if (rdev == rrdev)
 			rrdev = NULL;
 		if (rdev && (test_bit(Faulty, &rdev->flags)))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/4] md/raid10: fix incorrect done of recovery
  2023-05-27  7:22 ` [PATCH v3 3/4] md/raid10: fix incorrect done of recovery linan666
@ 2023-05-30 21:55   ` Song Liu
  2023-05-31  9:31     ` Li Nan
  0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2023-05-30 21:55 UTC (permalink / raw)
  To: linan666
  Cc: bingjingc, allenpeng, shli, alexwu, neilb, linux-raid,
	linux-kernel, linan122, yukuai3, yi.zhang, houtao1, yangerkun

On Sat, May 27, 2023 at 12:24 AM <linan666@huaweicloud.com> wrote:
>
> From: Li Nan <linan122@huawei.com>
>
> Recovery will go to giveup and let chunks_skipped++ in
> raid10_sync_request() if there are some bad_blocks, and it will return
> max_sector when chunks_skipped >= geo.raid_disks. Now, recovery fail and
> data is inconsistent but user think recovery is done, it is wrong.
>
> Fix it by set mirror's recovery_disabled, spare device will not  be added
> to here. The same issue alos exists on resync, it will be fixd in future.
>
> Signed-off-by: Li Nan <linan122@huawei.com>

I applied 1/4 and 2/4 of the set to md-next.

For 3/4 and 4/4, please improve the commit log (rephrase confusing statements,
fix typo's, etc.). Please also add a mdadm test for 3/4.

Thanks,
Song

> ---
>  drivers/md/raid10.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index d93d8cb2b620..3ba1516ea160 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>         int chunks_skipped = 0;
>         sector_t chunk_mask = conf->geo.chunk_mask;
>         int page_idx = 0;
> +       int error_disk = -1;
>
>         /*
>          * Allow skipping a full rebuild for incremental assembly
> @@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>                 return reshape_request(mddev, sector_nr, skipped);
>
>         if (chunks_skipped >= conf->geo.raid_disks) {
> -               /* if there has been nothing to do on any drive,
> +               pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
> +                       test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" : "recovery");
> +               if (error_disk >= 0 &&
> +                   !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> +                       /*
> +                        * recovery fail, set mirrors.recovory_disabled,
> +                        * device shouldn't be added to there.
> +                        */
> +                       conf->mirrors[error_disk].recovery_disabled =
> +                                               mddev->recovery_disabled;
> +                       return 0;
> +               }
> +               /*
> +                * if there has been nothing to do on any drive,
>                  * then there is nothing to do at all..
>                  */
>                 *skipped = 1;
> @@ -3638,6 +3652,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>                                                        mdname(mddev));
>                                         mirror->recovery_disabled
>                                                 = mddev->recovery_disabled;
> +                               } else {
> +                                       error_disk = i;
>                                 }
>                                 put_buf(r10_bio);
>                                 if (rb2)
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/4] md/raid10: fix incorrect done of recovery
  2023-05-30 21:55   ` Song Liu
@ 2023-05-31  9:31     ` Li Nan
  0 siblings, 0 replies; 7+ messages in thread
From: Li Nan @ 2023-05-31  9:31 UTC (permalink / raw)
  To: Song Liu, linan666
  Cc: bingjingc, allenpeng, shli, alexwu, neilb, linux-raid,
	linux-kernel, yukuai3, yi.zhang, houtao1, yangerkun



在 2023/5/31 5:55, Song Liu 写道:
> On Sat, May 27, 2023 at 12:24 AM <linan666@huaweicloud.com> wrote:
>>
>> From: Li Nan <linan122@huawei.com>
>>
>> Recovery will go to giveup and let chunks_skipped++ in
>> raid10_sync_request() if there are some bad_blocks, and it will return
>> max_sector when chunks_skipped >= geo.raid_disks. Now, recovery fail and
>> data is inconsistent but user think recovery is done, it is wrong.
>>
>> Fix it by set mirror's recovery_disabled, spare device will not  be added
>> to here. The same issue alos exists on resync, it will be fixd in future.
>>
>> Signed-off-by: Li Nan <linan122@huawei.com>
> 
> I applied 1/4 and 2/4 of the set to md-next.
> 
> For 3/4 and 4/4, please improve the commit log (rephrase confusing statements,
> fix typo's, etc.). Please also add a mdadm test for 3/4.
> 
> Thanks,
> Song
> 

I will add a test later. Thanks for review.

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-31  9:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-27  7:22 [PATCH v3 0/4] raid10 bugfix linan666
2023-05-27  7:22 ` [PATCH v3 1/4] md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request linan666
2023-05-27  7:22 ` [PATCH v3 2/4] md/raid10: improve code of mrdev " linan666
2023-05-27  7:22 ` [PATCH v3 3/4] md/raid10: fix incorrect done of recovery linan666
2023-05-30 21:55   ` Song Liu
2023-05-31  9:31     ` Li Nan
2023-05-27  7:22 ` [PATCH v3 4/4] md/raid10: fix io loss while replacement replace rdev linan666

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).