[md PATCH 0/2] Fix problems with reshape on degraded raid5/6

All of lore.kernel.org
 help / color / mirror / Atom feed

* [md PATCH 0/2] Fix problems with reshape on degraded raid5/6
@ 2017-10-17  5:18 NeilBrown
  2017-10-17  5:18 ` [md PATCH 1/2] raid5: Set R5_Expanded on parity devices as well as data NeilBrown
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: NeilBrown @ 2017-10-17  5:18 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

Reshape on a degraded array is supposed to work, but it
obviously hasn't been tested.
These two patches make at least one test case (reported by
Curt <lightspd@gmail.com>) to work.

Thanks,
NeilBrown


---

NeilBrown (2):
      raid5: Set R5_Expanded on parity devices as well as data.
      md: be cautious about using ->curr_resync_completed for ->recovery_offset


 drivers/md/md.c    |   32 +++++++++++++++++++++-----------
 drivers/md/raid5.c |   23 ++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 12 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [md PATCH 1/2] raid5: Set R5_Expanded on parity devices as well as data.
  2017-10-17  5:18 [md PATCH 0/2] Fix problems with reshape on degraded raid5/6 NeilBrown
@ 2017-10-17  5:18 ` NeilBrown
  2017-10-17  5:18 ` [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset NeilBrown
  2017-10-19  3:06 ` [md PATCH 0/2] Fix problems with reshape on degraded raid5/6 Shaohua Li
  2 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2017-10-17  5:18 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

When reshaping a fully degraded raid5/raid6 to a larger
nubmer of devices, the new device(s) are not in-sync
and so that can make the newly grown stripe appear to be
"failed".
To avoid this, we set the R5_Expanded flag to say "Even though
this device is not fully in-sync, this block is safe so
don't treat the device as failed for this stripe".
This flag is set for data devices, not not for parity devices.

Consequently, if you have a RAID6 with two devices that are partly
recovered and a spare, and start a reshape to include the spare,
then when the reshape gets past the point where the recovery was
up to, it will think the stripes are failed and will get into
an infinite loop, failing to make progress.

So when contructing parity on an EXPAND_READY stripe,
set R5_Expanded.

Reported-by: Curt <lightspd@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e070e5c68801..a8df52130f8a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1818,8 +1818,11 @@ static void ops_complete_reconstruct(void *stripe_head_ref)
 		struct r5dev *dev = &sh->dev[i];

 		if (dev->written || i == pd_idx || i == qd_idx) {
-			if (!discard && !test_bit(R5_SkipCopy, &dev->flags))
+			if (!discard && !test_bit(R5_SkipCopy, &dev->flags)) {
 				set_bit(R5_UPTODATE, &dev->flags);
+				if (test_bit(STRIPE_EXPAND_READY, &sh->state))
+					set_bit(R5_Expanded, &dev->flags);
+			}
 			if (fua)
 				set_bit(R5_WantFUA, &dev->flags);
 			if (sync)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset
  2017-10-17  5:18 [md PATCH 0/2] Fix problems with reshape on degraded raid5/6 NeilBrown
  2017-10-17  5:18 ` [md PATCH 1/2] raid5: Set R5_Expanded on parity devices as well as data NeilBrown
@ 2017-10-17  5:18 ` NeilBrown
  2017-10-31 14:42   ` Tomasz Majchrzak
  2017-10-19  3:06 ` [md PATCH 0/2] Fix problems with reshape on degraded raid5/6 Shaohua Li
  2 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2017-10-17  5:18 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

The ->recovery_offset shows how much of a non-InSync device is actually
in sync - how much has been recoveryed.

When performing a recovery, ->curr_resync and ->curr_resync_completed
follow the device address being recovered and so can be used to update
->recovery_offset.

When performing a reshape, ->curr_resync* might follow the device
addresses (raid5) or might follow array addresses (raid10), so cannot
in general be used to set ->recovery_offset.  When reshaping backwards,
->curre_resync* measures from the *end* of the array-or-device, so is
particularly unhelpful.

So change the common code in md.c to only use ->curr_resync_complete
for the simple recovery case, and add code to raid5.c to update
->recovery_offset during a forwards reshape.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c    |   32 +++++++++++++++++++++-----------
 drivers/md/raid5.c |   18 ++++++++++++++++++
 2 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a9f1352b3849..d1dfc9879368 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2453,10 +2453,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
 		}
 	}
 
-	/* First make sure individual recovery_offsets are correct */
+	/* First make sure individual recovery_offsets are correct
+	 * curr_resync_completed can only be used during recovery.
+	 * During reshape/resync it might use array-addresses rather
+	 * that device addresses.
+	 */
 	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk >= 0 &&
 		    mddev->delta_disks >= 0 &&
+		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+		    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
+		    !test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 		    !test_bit(Journal, &rdev->flags) &&
 		    !test_bit(In_sync, &rdev->flags) &&
 		    mddev->curr_resync_completed > rdev->recovery_offset)
@@ -8507,16 +8514,19 @@ void md_do_sync(struct md_thread *thread)
 		} else {
 			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 				mddev->curr_resync = MaxSector;
-			rcu_read_lock();
-			rdev_for_each_rcu(rdev, mddev)
-				if (rdev->raid_disk >= 0 &&
-				    mddev->delta_disks >= 0 &&
-				    !test_bit(Journal, &rdev->flags) &&
-				    !test_bit(Faulty, &rdev->flags) &&
-				    !test_bit(In_sync, &rdev->flags) &&
-				    rdev->recovery_offset < mddev->curr_resync)
-					rdev->recovery_offset = mddev->curr_resync;
-			rcu_read_unlock();
+			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
+				rcu_read_lock();
+				rdev_for_each_rcu(rdev, mddev)
+					if (rdev->raid_disk >= 0 &&
+					    mddev->delta_disks >= 0 &&
+					    !test_bit(Journal, &rdev->flags) &&
+					    !test_bit(Faulty, &rdev->flags) &&
+					    !test_bit(In_sync, &rdev->flags) &&
+					    rdev->recovery_offset < mddev->curr_resync)
+						rdev->recovery_offset = mddev->curr_resync;
+				rcu_read_unlock();
+			}
 		}
 	}
  skip:
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a8df52130f8a..89ad79614309 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5739,6 +5739,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	 */
 	struct r5conf *conf = mddev->private;
 	struct stripe_head *sh;
+	struct md_rdev *rdev;
 	sector_t first_sector, last_sector;
 	int raid_disks = conf->previous_raid_disks;
 	int data_disks = raid_disks - conf->max_degraded;
@@ -5861,6 +5862,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 			return 0;
 		mddev->reshape_position = conf->reshape_progress;
 		mddev->curr_resync_completed = sector_nr;
+		if (!mddev->reshape_backwards)
+			/* Can update recovery_offset */
+			rdev_for_each(rdev, mddev)
+				if (rdev->raid_disk >= 0 &&
+				    !test_bit(Journal, &rdev->flags) &&
+				    !test_bit(In_sync, &rdev->flags) &&
+				    rdev->recovery_offset < sector_nr)
+					rdev->recovery_offset = sector_nr;
+
 		conf->reshape_checkpoint = jiffies;
 		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		md_wakeup_thread(mddev->thread);
@@ -5959,6 +5969,14 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 			goto ret;
 		mddev->reshape_position = conf->reshape_progress;
 		mddev->curr_resync_completed = sector_nr;
+		if (!mddev->reshape_backwards)
+			/* Can update recovery_offset */
+			rdev_for_each(rdev, mddev)
+				if (rdev->raid_disk >= 0 &&
+				    !test_bit(Journal, &rdev->flags) &&
+				    !test_bit(In_sync, &rdev->flags) &&
+				    rdev->recovery_offset < sector_nr)
+					rdev->recovery_offset = sector_nr;
 		conf->reshape_checkpoint = jiffies;
 		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		md_wakeup_thread(mddev->thread);



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [md PATCH 0/2] Fix problems with reshape on degraded raid5/6
  2017-10-17  5:18 [md PATCH 0/2] Fix problems with reshape on degraded raid5/6 NeilBrown
  2017-10-17  5:18 ` [md PATCH 1/2] raid5: Set R5_Expanded on parity devices as well as data NeilBrown
  2017-10-17  5:18 ` [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset NeilBrown
@ 2017-10-19  3:06 ` Shaohua Li
  2 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2017-10-19  3:06 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, Oct 17, 2017 at 04:18:36PM +1100, Neil Brown wrote:
> Reshape on a degraded array is supposed to work, but it
> obviously hasn't been tested.
> These two patches make at least one test case (reported by
> Curt <lightspd@gmail.com>) to work.

Applied, thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset
  2017-10-17  5:18 ` [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset NeilBrown
@ 2017-10-31 14:42   ` Tomasz Majchrzak
  2017-11-01  1:13     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Tomasz Majchrzak @ 2017-10-31 14:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid

On Tue, Oct 17, 2017 at 04:18:36PM +1100, NeilBrown wrote:
> The ->recovery_offset shows how much of a non-InSync device is actually
> in sync - how much has been recoveryed.
> 
> When performing a recovery, ->curr_resync and ->curr_resync_completed
> follow the device address being recovered and so can be used to update
> ->recovery_offset.
> 
> When performing a reshape, ->curr_resync* might follow the device
> addresses (raid5) or might follow array addresses (raid10), so cannot
> in general be used to set ->recovery_offset.  When reshaping backwards,
> ->curre_resync* measures from the *end* of the array-or-device, so is
> particularly unhelpful.
> 
> So change the common code in md.c to only use ->curr_resync_complete
> for the simple recovery case, and add code to raid5.c to update
> ->recovery_offset during a forwards reshape.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/md.c    |   32 +++++++++++++++++++++-----------
>  drivers/md/raid5.c |   18 ++++++++++++++++++
>  2 files changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index a9f1352b3849..d1dfc9879368 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2453,10 +2453,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
>  		}
>  	}
>  
> -	/* First make sure individual recovery_offsets are correct */
> +	/* First make sure individual recovery_offsets are correct
> +	 * curr_resync_completed can only be used during recovery.
> +	 * During reshape/resync it might use array-addresses rather
> +	 * that device addresses.
> +	 */
>  	rdev_for_each(rdev, mddev) {
>  		if (rdev->raid_disk >= 0 &&
>  		    mddev->delta_disks >= 0 &&
> +		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> +		    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
> +		    !test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>  		    !test_bit(Journal, &rdev->flags) &&
>  		    !test_bit(In_sync, &rdev->flags) &&
>  		    mddev->curr_resync_completed > rdev->recovery_offset)
> @@ -8507,16 +8514,19 @@ void md_do_sync(struct md_thread *thread)
>  		} else {
>  			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
>  				mddev->curr_resync = MaxSector;
> -			rcu_read_lock();
> -			rdev_for_each_rcu(rdev, mddev)
> -				if (rdev->raid_disk >= 0 &&
> -				    mddev->delta_disks >= 0 &&
> -				    !test_bit(Journal, &rdev->flags) &&
> -				    !test_bit(Faulty, &rdev->flags) &&
> -				    !test_bit(In_sync, &rdev->flags) &&
> -				    rdev->recovery_offset < mddev->curr_resync)
> -					rdev->recovery_offset = mddev->curr_resync;
> -			rcu_read_unlock();
> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
> +				rcu_read_lock();
> +				rdev_for_each_rcu(rdev, mddev)
> +					if (rdev->raid_disk >= 0 &&
> +					    mddev->delta_disks >= 0 &&
> +					    !test_bit(Journal, &rdev->flags) &&
> +					    !test_bit(Faulty, &rdev->flags) &&
> +					    !test_bit(In_sync, &rdev->flags) &&
> +					    rdev->recovery_offset < mddev->curr_resync)
> +						rdev->recovery_offset = mddev->curr_resync;
> +				rcu_read_unlock();
> +			}
>  		}
>  	}
>   skip:
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index a8df52130f8a..89ad79614309 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5739,6 +5739,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>  	 */
>  	struct r5conf *conf = mddev->private;
>  	struct stripe_head *sh;
> +	struct md_rdev *rdev;
>  	sector_t first_sector, last_sector;
>  	int raid_disks = conf->previous_raid_disks;
>  	int data_disks = raid_disks - conf->max_degraded;
> @@ -5861,6 +5862,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>  			return 0;
>  		mddev->reshape_position = conf->reshape_progress;
>  		mddev->curr_resync_completed = sector_nr;
> +		if (!mddev->reshape_backwards)
> +			/* Can update recovery_offset */
> +			rdev_for_each(rdev, mddev)
> +				if (rdev->raid_disk >= 0 &&
> +				    !test_bit(Journal, &rdev->flags) &&
> +				    !test_bit(In_sync, &rdev->flags) &&
> +				    rdev->recovery_offset < sector_nr)
> +					rdev->recovery_offset = sector_nr;
> +
>  		conf->reshape_checkpoint = jiffies;
>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>  		md_wakeup_thread(mddev->thread);
> @@ -5959,6 +5969,14 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>  			goto ret;
>  		mddev->reshape_position = conf->reshape_progress;
>  		mddev->curr_resync_completed = sector_nr;
> +		if (!mddev->reshape_backwards)
> +			/* Can update recovery_offset */
> +			rdev_for_each(rdev, mddev)
> +				if (rdev->raid_disk >= 0 &&
> +				    !test_bit(Journal, &rdev->flags) &&
> +				    !test_bit(In_sync, &rdev->flags) &&
> +				    rdev->recovery_offset < sector_nr)
> +					rdev->recovery_offset = sector_nr;
>  		conf->reshape_checkpoint = jiffies;
>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>  		md_wakeup_thread(mddev->thread);
> 
> 
> --

Hi,

This patch has caused a regression in following IMSM RAID scenario:

mdadm --create /dev/md/imsm0 --metadata=imsm --raid-devices=4 /dev/nvme[0-3]n1 --run
mdadm --create /dev/md/r10 --level=10 --size=100M --raid-devices=4 /dev/nvme[0-3]n1 --run --assume-clean

MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/r10 --level=0
MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/imsm0 --raid-devices=4

cat /proc/mdstat 
Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] 
md126 : active raid4 nvme0n1[3] nvme2n1[2] nvme1n1[1]
      409600 blocks super external:/md127/0 level 4, 128k chunk, algorithm 0 [5/3] [UU_U_]
      
md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
      4420 blocks super external:imsm

The array should have been reshaped to 4-disk RAID0 but it has failed in
the middle of reshape as RAID4.

Following condition is not true for above scenario:

> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {

Tomek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset
  2017-10-31 14:42   ` Tomasz Majchrzak
@ 2017-11-01  1:13     ` NeilBrown
  2017-11-02  8:40       ` Tomasz Majchrzak
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2017-11-01  1:13 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: Shaohua Li, linux-raid

[-- Attachment #1: Type: text/plain, Size: 7795 bytes --]

On Tue, Oct 31 2017, Tomasz Majchrzak wrote:

> On Tue, Oct 17, 2017 at 04:18:36PM +1100, NeilBrown wrote:
>> The ->recovery_offset shows how much of a non-InSync device is actually
>> in sync - how much has been recoveryed.
>> 
>> When performing a recovery, ->curr_resync and ->curr_resync_completed
>> follow the device address being recovered and so can be used to update
>> ->recovery_offset.
>> 
>> When performing a reshape, ->curr_resync* might follow the device
>> addresses (raid5) or might follow array addresses (raid10), so cannot
>> in general be used to set ->recovery_offset.  When reshaping backwards,
>> ->curre_resync* measures from the *end* of the array-or-device, so is
>> particularly unhelpful.
>> 
>> So change the common code in md.c to only use ->curr_resync_complete
>> for the simple recovery case, and add code to raid5.c to update
>> ->recovery_offset during a forwards reshape.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/md.c    |   32 +++++++++++++++++++++-----------
>>  drivers/md/raid5.c |   18 ++++++++++++++++++
>>  2 files changed, 39 insertions(+), 11 deletions(-)
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index a9f1352b3849..d1dfc9879368 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -2453,10 +2453,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
>>  		}
>>  	}
>>  
>> -	/* First make sure individual recovery_offsets are correct */
>> +	/* First make sure individual recovery_offsets are correct
>> +	 * curr_resync_completed can only be used during recovery.
>> +	 * During reshape/resync it might use array-addresses rather
>> +	 * that device addresses.
>> +	 */
>>  	rdev_for_each(rdev, mddev) {
>>  		if (rdev->raid_disk >= 0 &&
>>  		    mddev->delta_disks >= 0 &&
>> +		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>> +		    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
>> +		    !test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>>  		    !test_bit(Journal, &rdev->flags) &&
>>  		    !test_bit(In_sync, &rdev->flags) &&
>>  		    mddev->curr_resync_completed > rdev->recovery_offset)
>> @@ -8507,16 +8514,19 @@ void md_do_sync(struct md_thread *thread)
>>  		} else {
>>  			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
>>  				mddev->curr_resync = MaxSector;
>> -			rcu_read_lock();
>> -			rdev_for_each_rcu(rdev, mddev)
>> -				if (rdev->raid_disk >= 0 &&
>> -				    mddev->delta_disks >= 0 &&
>> -				    !test_bit(Journal, &rdev->flags) &&
>> -				    !test_bit(Faulty, &rdev->flags) &&
>> -				    !test_bit(In_sync, &rdev->flags) &&
>> -				    rdev->recovery_offset < mddev->curr_resync)
>> -					rdev->recovery_offset = mddev->curr_resync;
>> -			rcu_read_unlock();
>> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
>> +				rcu_read_lock();
>> +				rdev_for_each_rcu(rdev, mddev)
>> +					if (rdev->raid_disk >= 0 &&
>> +					    mddev->delta_disks >= 0 &&
>> +					    !test_bit(Journal, &rdev->flags) &&
>> +					    !test_bit(Faulty, &rdev->flags) &&
>> +					    !test_bit(In_sync, &rdev->flags) &&
>> +					    rdev->recovery_offset < mddev->curr_resync)
>> +						rdev->recovery_offset = mddev->curr_resync;
>> +				rcu_read_unlock();
>> +			}
>>  		}
>>  	}
>>   skip:
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index a8df52130f8a..89ad79614309 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -5739,6 +5739,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>>  	 */
>>  	struct r5conf *conf = mddev->private;
>>  	struct stripe_head *sh;
>> +	struct md_rdev *rdev;
>>  	sector_t first_sector, last_sector;
>>  	int raid_disks = conf->previous_raid_disks;
>>  	int data_disks = raid_disks - conf->max_degraded;
>> @@ -5861,6 +5862,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>>  			return 0;
>>  		mddev->reshape_position = conf->reshape_progress;
>>  		mddev->curr_resync_completed = sector_nr;
>> +		if (!mddev->reshape_backwards)
>> +			/* Can update recovery_offset */
>> +			rdev_for_each(rdev, mddev)
>> +				if (rdev->raid_disk >= 0 &&
>> +				    !test_bit(Journal, &rdev->flags) &&
>> +				    !test_bit(In_sync, &rdev->flags) &&
>> +				    rdev->recovery_offset < sector_nr)
>> +					rdev->recovery_offset = sector_nr;
>> +
>>  		conf->reshape_checkpoint = jiffies;
>>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>>  		md_wakeup_thread(mddev->thread);
>> @@ -5959,6 +5969,14 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>>  			goto ret;
>>  		mddev->reshape_position = conf->reshape_progress;
>>  		mddev->curr_resync_completed = sector_nr;
>> +		if (!mddev->reshape_backwards)
>> +			/* Can update recovery_offset */
>> +			rdev_for_each(rdev, mddev)
>> +				if (rdev->raid_disk >= 0 &&
>> +				    !test_bit(Journal, &rdev->flags) &&
>> +				    !test_bit(In_sync, &rdev->flags) &&
>> +				    rdev->recovery_offset < sector_nr)
>> +					rdev->recovery_offset = sector_nr;
>>  		conf->reshape_checkpoint = jiffies;
>>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>>  		md_wakeup_thread(mddev->thread);
>> 
>> 
>> --
>
> Hi,
>
> This patch has caused a regression in following IMSM RAID scenario:

That's unfortunate.

>
> mdadm --create /dev/md/imsm0 --metadata=imsm --raid-devices=4 /dev/nvme[0-3]n1 --run
> mdadm --create /dev/md/r10 --level=10 --size=100M --raid-devices=4 /dev/nvme[0-3]n1 --run --assume-clean
>
> MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/r10 --level=0
> MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/imsm0 --raid-devices=4
>
> cat /proc/mdstat 
> Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] 
> md126 : active raid4 nvme0n1[3] nvme2n1[2] nvme1n1[1]
>       409600 blocks super external:/md127/0 level 4, 128k chunk, algorithm 0 [5/3] [UU_U_]
>       
> md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
>       4420 blocks super external:imsm

Why isn't nvme3n1 in md126?  Did it not get added, or did it get removed
for some reason?
If it wasn't added, raid5_start_reshape() would have failed, so I guess
it must have been removed??


>
> The array should have been reshaped to 4-disk RAID0 but it has failed in
> the middle of reshape as RAID4.
>
> Following condition is not true for above scenario:
>
>> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {

No, it isn't meant to be.  This code doesn't work reliably in the
reshape case. When reshaping raid5 (or raid4), reshape_request updates
rdev->recovery_offset. ... maybe end_reshape needs to update it too.

If you add this patch, does it help?

Thanks,
NeilBrown

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 89ad79614309..a99ae1be9175 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7964,6 +7964,7 @@ static void end_reshape(struct r5conf *conf)
 {
 
 	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
+		struct md_rdev *rdev;
 
 		spin_lock_irq(&conf->device_lock);
 		conf->previous_raid_disks = conf->raid_disks;
@@ -7971,6 +7972,12 @@ static void end_reshape(struct r5conf *conf)
 		smp_wmb();
 		conf->reshape_progress = MaxSector;
 		conf->mddev->reshape_position = MaxSector;
+		rdev_for_each(rdev, conf->mddev)
+			if (rdev->raid_disk >= 0 &&
+			    !test_bit(Journal, &rdev->flags) &&
+			    !test_bit(In_sync, &rdev->flags))
+				rdev->recovery_offset = MaxSector;
+
 		spin_unlock_irq(&conf->device_lock);
 		wake_up(&conf->wait_for_overlap);
 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset
  2017-11-01  1:13     ` NeilBrown
@ 2017-11-02  8:40       ` Tomasz Majchrzak
  2017-11-09  0:08         ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Tomasz Majchrzak @ 2017-11-02  8:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid

On Wed, Nov 01, 2017 at 12:13:22PM +1100, NeilBrown wrote:
> On Tue, Oct 31 2017, Tomasz Majchrzak wrote:
> 
> > On Tue, Oct 17, 2017 at 04:18:36PM +1100, NeilBrown wrote:
> >> The ->recovery_offset shows how much of a non-InSync device is actually
> >> in sync - how much has been recoveryed.
> >> 
> >> When performing a recovery, ->curr_resync and ->curr_resync_completed
> >> follow the device address being recovered and so can be used to update
> >> ->recovery_offset.
> >> 
> >> When performing a reshape, ->curr_resync* might follow the device
> >> addresses (raid5) or might follow array addresses (raid10), so cannot
> >> in general be used to set ->recovery_offset.  When reshaping backwards,
> >> ->curre_resync* measures from the *end* of the array-or-device, so is
> >> particularly unhelpful.
> >> 
> >> So change the common code in md.c to only use ->curr_resync_complete
> >> for the simple recovery case, and add code to raid5.c to update
> >> ->recovery_offset during a forwards reshape.
> >> 
> >> Signed-off-by: NeilBrown <neilb@suse.com>
> >> ---
> >>  drivers/md/md.c    |   32 +++++++++++++++++++++-----------
> >>  drivers/md/raid5.c |   18 ++++++++++++++++++
> >>  2 files changed, 39 insertions(+), 11 deletions(-)
> >> 
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index a9f1352b3849..d1dfc9879368 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -2453,10 +2453,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
> >>  		}
> >>  	}
> >>  
> >> -	/* First make sure individual recovery_offsets are correct */
> >> +	/* First make sure individual recovery_offsets are correct
> >> +	 * curr_resync_completed can only be used during recovery.
> >> +	 * During reshape/resync it might use array-addresses rather
> >> +	 * that device addresses.
> >> +	 */
> >>  	rdev_for_each(rdev, mddev) {
> >>  		if (rdev->raid_disk >= 0 &&
> >>  		    mddev->delta_disks >= 0 &&
> >> +		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> >> +		    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
> >> +		    !test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> >>  		    !test_bit(Journal, &rdev->flags) &&
> >>  		    !test_bit(In_sync, &rdev->flags) &&
> >>  		    mddev->curr_resync_completed > rdev->recovery_offset)
> >> @@ -8507,16 +8514,19 @@ void md_do_sync(struct md_thread *thread)
> >>  		} else {
> >>  			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
> >>  				mddev->curr_resync = MaxSector;
> >> -			rcu_read_lock();
> >> -			rdev_for_each_rcu(rdev, mddev)
> >> -				if (rdev->raid_disk >= 0 &&
> >> -				    mddev->delta_disks >= 0 &&
> >> -				    !test_bit(Journal, &rdev->flags) &&
> >> -				    !test_bit(Faulty, &rdev->flags) &&
> >> -				    !test_bit(In_sync, &rdev->flags) &&
> >> -				    rdev->recovery_offset < mddev->curr_resync)
> >> -					rdev->recovery_offset = mddev->curr_resync;
> >> -			rcu_read_unlock();
> >> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> >> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
> >> +				rcu_read_lock();
> >> +				rdev_for_each_rcu(rdev, mddev)
> >> +					if (rdev->raid_disk >= 0 &&
> >> +					    mddev->delta_disks >= 0 &&
> >> +					    !test_bit(Journal, &rdev->flags) &&
> >> +					    !test_bit(Faulty, &rdev->flags) &&
> >> +					    !test_bit(In_sync, &rdev->flags) &&
> >> +					    rdev->recovery_offset < mddev->curr_resync)
> >> +						rdev->recovery_offset = mddev->curr_resync;
> >> +				rcu_read_unlock();
> >> +			}
> >>  		}
> >>  	}
> >>   skip:
> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> index a8df52130f8a..89ad79614309 100644
> >> --- a/drivers/md/raid5.c
> >> +++ b/drivers/md/raid5.c
> >> @@ -5739,6 +5739,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> >>  	 */
> >>  	struct r5conf *conf = mddev->private;
> >>  	struct stripe_head *sh;
> >> +	struct md_rdev *rdev;
> >>  	sector_t first_sector, last_sector;
> >>  	int raid_disks = conf->previous_raid_disks;
> >>  	int data_disks = raid_disks - conf->max_degraded;
> >> @@ -5861,6 +5862,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> >>  			return 0;
> >>  		mddev->reshape_position = conf->reshape_progress;
> >>  		mddev->curr_resync_completed = sector_nr;
> >> +		if (!mddev->reshape_backwards)
> >> +			/* Can update recovery_offset */
> >> +			rdev_for_each(rdev, mddev)
> >> +				if (rdev->raid_disk >= 0 &&
> >> +				    !test_bit(Journal, &rdev->flags) &&
> >> +				    !test_bit(In_sync, &rdev->flags) &&
> >> +				    rdev->recovery_offset < sector_nr)
> >> +					rdev->recovery_offset = sector_nr;
> >> +
> >>  		conf->reshape_checkpoint = jiffies;
> >>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> >>  		md_wakeup_thread(mddev->thread);
> >> @@ -5959,6 +5969,14 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> >>  			goto ret;
> >>  		mddev->reshape_position = conf->reshape_progress;
> >>  		mddev->curr_resync_completed = sector_nr;
> >> +		if (!mddev->reshape_backwards)
> >> +			/* Can update recovery_offset */
> >> +			rdev_for_each(rdev, mddev)
> >> +				if (rdev->raid_disk >= 0 &&
> >> +				    !test_bit(Journal, &rdev->flags) &&
> >> +				    !test_bit(In_sync, &rdev->flags) &&
> >> +				    rdev->recovery_offset < sector_nr)
> >> +					rdev->recovery_offset = sector_nr;
> >>  		conf->reshape_checkpoint = jiffies;
> >>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> >>  		md_wakeup_thread(mddev->thread);
> >> 
> >> 
> >> --
> >
> > Hi,
> >
> > This patch has caused a regression in following IMSM RAID scenario:
> 
> That's unfortunate.
> 
> >
> > mdadm --create /dev/md/imsm0 --metadata=imsm --raid-devices=4 /dev/nvme[0-3]n1 --run
> > mdadm --create /dev/md/r10 --level=10 --size=100M --raid-devices=4 /dev/nvme[0-3]n1 --run --assume-clean
> >
> > MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/r10 --level=0
> > MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/imsm0 --raid-devices=4
> >
> > cat /proc/mdstat 
> > Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] 
> > md126 : active raid4 nvme0n1[3] nvme2n1[2] nvme1n1[1]
> >       409600 blocks super external:/md127/0 level 4, 128k chunk, algorithm 0 [5/3] [UU_U_]
> >       
> > md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
> >       4420 blocks super external:imsm
> 
> Why isn't nvme3n1 in md126?  Did it not get added, or did it get removed
> for some reason?
> If it wasn't added, raid5_start_reshape() would have failed, so I guess
> it must have been removed??

It got removed. It's in the array during the reshape:

cat /proc/mdstat 
Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] [raid1] 
md126 : active raid4 nvme3n1[4] nvme0n1[3] nvme2n1[2] nvme1n1[1]
      4194304 blocks super external:-md127/0 level 4, 128k chunk, algorithm 5 [5/4] [UU_U_]
      [===>.................]  reshape = 16.6% (350476/2097152) finish=0.1min speed=175238K/sec
      
md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
      4420 blocks super external:imsm

> 
> >
> > The array should have been reshaped to 4-disk RAID0 but it has failed in
> > the middle of reshape as RAID4.
> >
> > Following condition is not true for above scenario:
> >
> >> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> >> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
> 
> No, it isn't meant to be.  This code doesn't work reliably in the
> reshape case. When reshaping raid5 (or raid4), reshape_request updates
> rdev->recovery_offset. ... maybe end_reshape needs to update it too.
> 
> If you add this patch, does it help?
>
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 89ad79614309..a99ae1be9175 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7964,6 +7964,7 @@ static void end_reshape(struct r5conf *conf)
>  {
>  
>  	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
> +		struct md_rdev *rdev;
>  
>  		spin_lock_irq(&conf->device_lock);
>  		conf->previous_raid_disks = conf->raid_disks;
> @@ -7971,6 +7972,12 @@ static void end_reshape(struct r5conf *conf)
>  		smp_wmb();
>  		conf->reshape_progress = MaxSector;
>  		conf->mddev->reshape_position = MaxSector;
> +		rdev_for_each(rdev, conf->mddev)
> +			if (rdev->raid_disk >= 0 &&
> +			    !test_bit(Journal, &rdev->flags) &&
> +			    !test_bit(In_sync, &rdev->flags))
> +				rdev->recovery_offset = MaxSector;
> +
>  		spin_unlock_irq(&conf->device_lock);
>  		wake_up(&conf->wait_for_overlap);
>  

Yes, this patch fixes the problem. I have also checked that reshape position
is really stored and reshape interrupted in the middle restarts from the
last checkpoint.

Thank you,

Tomek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset
  2017-11-02  8:40       ` Tomasz Majchrzak
@ 2017-11-09  0:08         ` Shaohua Li
  2017-11-09  7:55           ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2017-11-09  0:08 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: NeilBrown, linux-raid

On Thu, Nov 02, 2017 at 09:40:50AM +0100, Tomasz Majchrzak wrote:
> On Wed, Nov 01, 2017 at 12:13:22PM +1100, NeilBrown wrote:
> > On Tue, Oct 31 2017, Tomasz Majchrzak wrote:
> > 
> > > On Tue, Oct 17, 2017 at 04:18:36PM +1100, NeilBrown wrote:
> > >> The ->recovery_offset shows how much of a non-InSync device is actually
> > >> in sync - how much has been recoveryed.
> > >> 
> > >> When performing a recovery, ->curr_resync and ->curr_resync_completed
> > >> follow the device address being recovered and so can be used to update
> > >> ->recovery_offset.
> > >> 
> > >> When performing a reshape, ->curr_resync* might follow the device
> > >> addresses (raid5) or might follow array addresses (raid10), so cannot
> > >> in general be used to set ->recovery_offset.  When reshaping backwards,
> > >> ->curre_resync* measures from the *end* of the array-or-device, so is
> > >> particularly unhelpful.
> > >> 
> > >> So change the common code in md.c to only use ->curr_resync_complete
> > >> for the simple recovery case, and add code to raid5.c to update
> > >> ->recovery_offset during a forwards reshape.
> > >> 
> > >> Signed-off-by: NeilBrown <neilb@suse.com>
> > >> ---
> > >>  drivers/md/md.c    |   32 +++++++++++++++++++++-----------
> > >>  drivers/md/raid5.c |   18 ++++++++++++++++++
> > >>  2 files changed, 39 insertions(+), 11 deletions(-)
> > >> 
> > >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> > >> index a9f1352b3849..d1dfc9879368 100644
> > >> --- a/drivers/md/md.c
> > >> +++ b/drivers/md/md.c
> > >> @@ -2453,10 +2453,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
> > >>  		}
> > >>  	}
> > >>  
> > >> -	/* First make sure individual recovery_offsets are correct */
> > >> +	/* First make sure individual recovery_offsets are correct
> > >> +	 * curr_resync_completed can only be used during recovery.
> > >> +	 * During reshape/resync it might use array-addresses rather
> > >> +	 * that device addresses.
> > >> +	 */
> > >>  	rdev_for_each(rdev, mddev) {
> > >>  		if (rdev->raid_disk >= 0 &&
> > >>  		    mddev->delta_disks >= 0 &&
> > >> +		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> > >> +		    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
> > >> +		    !test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> > >>  		    !test_bit(Journal, &rdev->flags) &&
> > >>  		    !test_bit(In_sync, &rdev->flags) &&
> > >>  		    mddev->curr_resync_completed > rdev->recovery_offset)
> > >> @@ -8507,16 +8514,19 @@ void md_do_sync(struct md_thread *thread)
> > >>  		} else {
> > >>  			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
> > >>  				mddev->curr_resync = MaxSector;
> > >> -			rcu_read_lock();
> > >> -			rdev_for_each_rcu(rdev, mddev)
> > >> -				if (rdev->raid_disk >= 0 &&
> > >> -				    mddev->delta_disks >= 0 &&
> > >> -				    !test_bit(Journal, &rdev->flags) &&
> > >> -				    !test_bit(Faulty, &rdev->flags) &&
> > >> -				    !test_bit(In_sync, &rdev->flags) &&
> > >> -				    rdev->recovery_offset < mddev->curr_resync)
> > >> -					rdev->recovery_offset = mddev->curr_resync;
> > >> -			rcu_read_unlock();
> > >> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> > >> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
> > >> +				rcu_read_lock();
> > >> +				rdev_for_each_rcu(rdev, mddev)
> > >> +					if (rdev->raid_disk >= 0 &&
> > >> +					    mddev->delta_disks >= 0 &&
> > >> +					    !test_bit(Journal, &rdev->flags) &&
> > >> +					    !test_bit(Faulty, &rdev->flags) &&
> > >> +					    !test_bit(In_sync, &rdev->flags) &&
> > >> +					    rdev->recovery_offset < mddev->curr_resync)
> > >> +						rdev->recovery_offset = mddev->curr_resync;
> > >> +				rcu_read_unlock();
> > >> +			}
> > >>  		}
> > >>  	}
> > >>   skip:
> > >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > >> index a8df52130f8a..89ad79614309 100644
> > >> --- a/drivers/md/raid5.c
> > >> +++ b/drivers/md/raid5.c
> > >> @@ -5739,6 +5739,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> > >>  	 */
> > >>  	struct r5conf *conf = mddev->private;
> > >>  	struct stripe_head *sh;
> > >> +	struct md_rdev *rdev;
> > >>  	sector_t first_sector, last_sector;
> > >>  	int raid_disks = conf->previous_raid_disks;
> > >>  	int data_disks = raid_disks - conf->max_degraded;
> > >> @@ -5861,6 +5862,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> > >>  			return 0;
> > >>  		mddev->reshape_position = conf->reshape_progress;
> > >>  		mddev->curr_resync_completed = sector_nr;
> > >> +		if (!mddev->reshape_backwards)
> > >> +			/* Can update recovery_offset */
> > >> +			rdev_for_each(rdev, mddev)
> > >> +				if (rdev->raid_disk >= 0 &&
> > >> +				    !test_bit(Journal, &rdev->flags) &&
> > >> +				    !test_bit(In_sync, &rdev->flags) &&
> > >> +				    rdev->recovery_offset < sector_nr)
> > >> +					rdev->recovery_offset = sector_nr;
> > >> +
> > >>  		conf->reshape_checkpoint = jiffies;
> > >>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> > >>  		md_wakeup_thread(mddev->thread);
> > >> @@ -5959,6 +5969,14 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> > >>  			goto ret;
> > >>  		mddev->reshape_position = conf->reshape_progress;
> > >>  		mddev->curr_resync_completed = sector_nr;
> > >> +		if (!mddev->reshape_backwards)
> > >> +			/* Can update recovery_offset */
> > >> +			rdev_for_each(rdev, mddev)
> > >> +				if (rdev->raid_disk >= 0 &&
> > >> +				    !test_bit(Journal, &rdev->flags) &&
> > >> +				    !test_bit(In_sync, &rdev->flags) &&
> > >> +				    rdev->recovery_offset < sector_nr)
> > >> +					rdev->recovery_offset = sector_nr;
> > >>  		conf->reshape_checkpoint = jiffies;
> > >>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> > >>  		md_wakeup_thread(mddev->thread);
> > >> 
> > >> 
> > >> --
> > >
> > > Hi,
> > >
> > > This patch has caused a regression in following IMSM RAID scenario:
> > 
> > That's unfortunate.
> > 
> > >
> > > mdadm --create /dev/md/imsm0 --metadata=imsm --raid-devices=4 /dev/nvme[0-3]n1 --run
> > > mdadm --create /dev/md/r10 --level=10 --size=100M --raid-devices=4 /dev/nvme[0-3]n1 --run --assume-clean
> > >
> > > MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/r10 --level=0
> > > MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/imsm0 --raid-devices=4
> > >
> > > cat /proc/mdstat 
> > > Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] 
> > > md126 : active raid4 nvme0n1[3] nvme2n1[2] nvme1n1[1]
> > >       409600 blocks super external:/md127/0 level 4, 128k chunk, algorithm 0 [5/3] [UU_U_]
> > >       
> > > md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
> > >       4420 blocks super external:imsm
> > 
> > Why isn't nvme3n1 in md126?  Did it not get added, or did it get removed
> > for some reason?
> > If it wasn't added, raid5_start_reshape() would have failed, so I guess
> > it must have been removed??
> 
> It got removed. It's in the array during the reshape:
> 
> cat /proc/mdstat 
> Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] [raid1] 
> md126 : active raid4 nvme3n1[4] nvme0n1[3] nvme2n1[2] nvme1n1[1]
>       4194304 blocks super external:-md127/0 level 4, 128k chunk, algorithm 5 [5/4] [UU_U_]
>       [===>.................]  reshape = 16.6% (350476/2097152) finish=0.1min speed=175238K/sec
>       
> md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
>       4420 blocks super external:imsm
> 
> > 
> > >
> > > The array should have been reshaped to 4-disk RAID0 but it has failed in
> > > the middle of reshape as RAID4.
> > >
> > > Following condition is not true for above scenario:
> > >
> > >> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> > >> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
> > 
> > No, it isn't meant to be.  This code doesn't work reliably in the
> > reshape case. When reshaping raid5 (or raid4), reshape_request updates
> > rdev->recovery_offset. ... maybe end_reshape needs to update it too.
> > 
> > If you add this patch, does it help?
> >
> > 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 89ad79614309..a99ae1be9175 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -7964,6 +7964,7 @@ static void end_reshape(struct r5conf *conf)
> >  {
> >  
> >  	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
> > +		struct md_rdev *rdev;
> >  
> >  		spin_lock_irq(&conf->device_lock);
> >  		conf->previous_raid_disks = conf->raid_disks;
> > @@ -7971,6 +7972,12 @@ static void end_reshape(struct r5conf *conf)
> >  		smp_wmb();
> >  		conf->reshape_progress = MaxSector;
> >  		conf->mddev->reshape_position = MaxSector;
> > +		rdev_for_each(rdev, conf->mddev)
> > +			if (rdev->raid_disk >= 0 &&
> > +			    !test_bit(Journal, &rdev->flags) &&
> > +			    !test_bit(In_sync, &rdev->flags))
> > +				rdev->recovery_offset = MaxSector;
> > +
> >  		spin_unlock_irq(&conf->device_lock);
> >  		wake_up(&conf->wait_for_overlap);
> >  
> 
> Yes, this patch fixes the problem. I have also checked that reshape position
> is really stored and reshape interrupted in the middle restarts from the
> last checkpoint.

Neil,
can you send me a formal patch for this? or I can just fold this into your
original patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset
  2017-11-09  0:08         ` Shaohua Li
@ 2017-11-09  7:55           ` NeilBrown
  2017-11-09 15:31             ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2017-11-09  7:55 UTC (permalink / raw)
  To: Shaohua Li, Tomasz Majchrzak; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 9883 bytes --]

On Wed, Nov 08 2017, Shaohua Li wrote:

> On Thu, Nov 02, 2017 at 09:40:50AM +0100, Tomasz Majchrzak wrote:
>> On Wed, Nov 01, 2017 at 12:13:22PM +1100, NeilBrown wrote:
>> > On Tue, Oct 31 2017, Tomasz Majchrzak wrote:
>> > 
>> > > On Tue, Oct 17, 2017 at 04:18:36PM +1100, NeilBrown wrote:
>> > >> The ->recovery_offset shows how much of a non-InSync device is actually
>> > >> in sync - how much has been recoveryed.
>> > >> 
>> > >> When performing a recovery, ->curr_resync and ->curr_resync_completed
>> > >> follow the device address being recovered and so can be used to update
>> > >> ->recovery_offset.
>> > >> 
>> > >> When performing a reshape, ->curr_resync* might follow the device
>> > >> addresses (raid5) or might follow array addresses (raid10), so cannot
>> > >> in general be used to set ->recovery_offset.  When reshaping backwards,
>> > >> ->curre_resync* measures from the *end* of the array-or-device, so is
>> > >> particularly unhelpful.
>> > >> 
>> > >> So change the common code in md.c to only use ->curr_resync_complete
>> > >> for the simple recovery case, and add code to raid5.c to update
>> > >> ->recovery_offset during a forwards reshape.
>> > >> 
>> > >> Signed-off-by: NeilBrown <neilb@suse.com>
>> > >> ---
>> > >>  drivers/md/md.c    |   32 +++++++++++++++++++++-----------
>> > >>  drivers/md/raid5.c |   18 ++++++++++++++++++
>> > >>  2 files changed, 39 insertions(+), 11 deletions(-)
>> > >> 
>> > >> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > >> index a9f1352b3849..d1dfc9879368 100644
>> > >> --- a/drivers/md/md.c
>> > >> +++ b/drivers/md/md.c
>> > >> @@ -2453,10 +2453,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
>> > >>  		}
>> > >>  	}
>> > >>  
>> > >> -	/* First make sure individual recovery_offsets are correct */
>> > >> +	/* First make sure individual recovery_offsets are correct
>> > >> +	 * curr_resync_completed can only be used during recovery.
>> > >> +	 * During reshape/resync it might use array-addresses rather
>> > >> +	 * that device addresses.
>> > >> +	 */
>> > >>  	rdev_for_each(rdev, mddev) {
>> > >>  		if (rdev->raid_disk >= 0 &&
>> > >>  		    mddev->delta_disks >= 0 &&
>> > >> +		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>> > >> +		    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
>> > >> +		    !test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> > >>  		    !test_bit(Journal, &rdev->flags) &&
>> > >>  		    !test_bit(In_sync, &rdev->flags) &&
>> > >>  		    mddev->curr_resync_completed > rdev->recovery_offset)
>> > >> @@ -8507,16 +8514,19 @@ void md_do_sync(struct md_thread *thread)
>> > >>  		} else {
>> > >>  			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
>> > >>  				mddev->curr_resync = MaxSector;
>> > >> -			rcu_read_lock();
>> > >> -			rdev_for_each_rcu(rdev, mddev)
>> > >> -				if (rdev->raid_disk >= 0 &&
>> > >> -				    mddev->delta_disks >= 0 &&
>> > >> -				    !test_bit(Journal, &rdev->flags) &&
>> > >> -				    !test_bit(Faulty, &rdev->flags) &&
>> > >> -				    !test_bit(In_sync, &rdev->flags) &&
>> > >> -				    rdev->recovery_offset < mddev->curr_resync)
>> > >> -					rdev->recovery_offset = mddev->curr_resync;
>> > >> -			rcu_read_unlock();
>> > >> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> > >> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
>> > >> +				rcu_read_lock();
>> > >> +				rdev_for_each_rcu(rdev, mddev)
>> > >> +					if (rdev->raid_disk >= 0 &&
>> > >> +					    mddev->delta_disks >= 0 &&
>> > >> +					    !test_bit(Journal, &rdev->flags) &&
>> > >> +					    !test_bit(Faulty, &rdev->flags) &&
>> > >> +					    !test_bit(In_sync, &rdev->flags) &&
>> > >> +					    rdev->recovery_offset < mddev->curr_resync)
>> > >> +						rdev->recovery_offset = mddev->curr_resync;
>> > >> +				rcu_read_unlock();
>> > >> +			}
>> > >>  		}
>> > >>  	}
>> > >>   skip:
>> > >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> > >> index a8df52130f8a..89ad79614309 100644
>> > >> --- a/drivers/md/raid5.c
>> > >> +++ b/drivers/md/raid5.c
>> > >> @@ -5739,6 +5739,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>> > >>  	 */
>> > >>  	struct r5conf *conf = mddev->private;
>> > >>  	struct stripe_head *sh;
>> > >> +	struct md_rdev *rdev;
>> > >>  	sector_t first_sector, last_sector;
>> > >>  	int raid_disks = conf->previous_raid_disks;
>> > >>  	int data_disks = raid_disks - conf->max_degraded;
>> > >> @@ -5861,6 +5862,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>> > >>  			return 0;
>> > >>  		mddev->reshape_position = conf->reshape_progress;
>> > >>  		mddev->curr_resync_completed = sector_nr;
>> > >> +		if (!mddev->reshape_backwards)
>> > >> +			/* Can update recovery_offset */
>> > >> +			rdev_for_each(rdev, mddev)
>> > >> +				if (rdev->raid_disk >= 0 &&
>> > >> +				    !test_bit(Journal, &rdev->flags) &&
>> > >> +				    !test_bit(In_sync, &rdev->flags) &&
>> > >> +				    rdev->recovery_offset < sector_nr)
>> > >> +					rdev->recovery_offset = sector_nr;
>> > >> +
>> > >>  		conf->reshape_checkpoint = jiffies;
>> > >>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>> > >>  		md_wakeup_thread(mddev->thread);
>> > >> @@ -5959,6 +5969,14 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>> > >>  			goto ret;
>> > >>  		mddev->reshape_position = conf->reshape_progress;
>> > >>  		mddev->curr_resync_completed = sector_nr;
>> > >> +		if (!mddev->reshape_backwards)
>> > >> +			/* Can update recovery_offset */
>> > >> +			rdev_for_each(rdev, mddev)
>> > >> +				if (rdev->raid_disk >= 0 &&
>> > >> +				    !test_bit(Journal, &rdev->flags) &&
>> > >> +				    !test_bit(In_sync, &rdev->flags) &&
>> > >> +				    rdev->recovery_offset < sector_nr)
>> > >> +					rdev->recovery_offset = sector_nr;
>> > >>  		conf->reshape_checkpoint = jiffies;
>> > >>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>> > >>  		md_wakeup_thread(mddev->thread);
>> > >> 
>> > >> 
>> > >> --
>> > >
>> > > Hi,
>> > >
>> > > This patch has caused a regression in following IMSM RAID scenario:
>> > 
>> > That's unfortunate.
>> > 
>> > >
>> > > mdadm --create /dev/md/imsm0 --metadata=imsm --raid-devices=4 /dev/nvme[0-3]n1 --run
>> > > mdadm --create /dev/md/r10 --level=10 --size=100M --raid-devices=4 /dev/nvme[0-3]n1 --run --assume-clean
>> > >
>> > > MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/r10 --level=0
>> > > MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/imsm0 --raid-devices=4
>> > >
>> > > cat /proc/mdstat 
>> > > Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] 
>> > > md126 : active raid4 nvme0n1[3] nvme2n1[2] nvme1n1[1]
>> > >       409600 blocks super external:/md127/0 level 4, 128k chunk, algorithm 0 [5/3] [UU_U_]
>> > >       
>> > > md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
>> > >       4420 blocks super external:imsm
>> > 
>> > Why isn't nvme3n1 in md126?  Did it not get added, or did it get removed
>> > for some reason?
>> > If it wasn't added, raid5_start_reshape() would have failed, so I guess
>> > it must have been removed??
>> 
>> It got removed. It's in the array during the reshape:
>> 
>> cat /proc/mdstat 
>> Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] [raid1] 
>> md126 : active raid4 nvme3n1[4] nvme0n1[3] nvme2n1[2] nvme1n1[1]
>>       4194304 blocks super external:-md127/0 level 4, 128k chunk, algorithm 5 [5/4] [UU_U_]
>>       [===>.................]  reshape = 16.6% (350476/2097152) finish=0.1min speed=175238K/sec
>>       
>> md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
>>       4420 blocks super external:imsm
>> 
>> > 
>> > >
>> > > The array should have been reshaped to 4-disk RAID0 but it has failed in
>> > > the middle of reshape as RAID4.
>> > >
>> > > Following condition is not true for above scenario:
>> > >
>> > >> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> > >> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
>> > 
>> > No, it isn't meant to be.  This code doesn't work reliably in the
>> > reshape case. When reshaping raid5 (or raid4), reshape_request updates
>> > rdev->recovery_offset. ... maybe end_reshape needs to update it too.
>> > 
>> > If you add this patch, does it help?
>> >
>> > 
>> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> > index 89ad79614309..a99ae1be9175 100644
>> > --- a/drivers/md/raid5.c
>> > +++ b/drivers/md/raid5.c
>> > @@ -7964,6 +7964,7 @@ static void end_reshape(struct r5conf *conf)
>> >  {
>> >  
>> >  	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
>> > +		struct md_rdev *rdev;
>> >  
>> >  		spin_lock_irq(&conf->device_lock);
>> >  		conf->previous_raid_disks = conf->raid_disks;
>> > @@ -7971,6 +7972,12 @@ static void end_reshape(struct r5conf *conf)
>> >  		smp_wmb();
>> >  		conf->reshape_progress = MaxSector;
>> >  		conf->mddev->reshape_position = MaxSector;
>> > +		rdev_for_each(rdev, conf->mddev)
>> > +			if (rdev->raid_disk >= 0 &&
>> > +			    !test_bit(Journal, &rdev->flags) &&
>> > +			    !test_bit(In_sync, &rdev->flags))
>> > +				rdev->recovery_offset = MaxSector;
>> > +
>> >  		spin_unlock_irq(&conf->device_lock);
>> >  		wake_up(&conf->wait_for_overlap);
>> >  
>> 
>> Yes, this patch fixes the problem. I have also checked that reshape position
>> is really stored and reshape interrupted in the middle restarts from the
>> last checkpoint.
>
> Neil,
> can you send me a formal patch for this? or I can just fold this into your
> original patch.

If you can fold it in, that would be great.  Otherwise it may be a while
before I get to it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset
  2017-11-09  7:55           ` NeilBrown
@ 2017-11-09 15:31             ` Shaohua Li
  0 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2017-11-09 15:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: Tomasz Majchrzak, linux-raid

On Thu, Nov 09, 2017 at 06:55:57PM +1100, Neil Brown wrote:
> On Wed, Nov 08 2017, Shaohua Li wrote:
> 
> > On Thu, Nov 02, 2017 at 09:40:50AM +0100, Tomasz Majchrzak wrote:
> >> On Wed, Nov 01, 2017 at 12:13:22PM +1100, NeilBrown wrote:
> >> > On Tue, Oct 31 2017, Tomasz Majchrzak wrote:
> >> > 
> >> > > On Tue, Oct 17, 2017 at 04:18:36PM +1100, NeilBrown wrote:
> >> > >> The ->recovery_offset shows how much of a non-InSync device is actually
> >> > >> in sync - how much has been recoveryed.
> >> > >> 
> >> > >> When performing a recovery, ->curr_resync and ->curr_resync_completed
> >> > >> follow the device address being recovered and so can be used to update
> >> > >> ->recovery_offset.
> >> > >> 
> >> > >> When performing a reshape, ->curr_resync* might follow the device
> >> > >> addresses (raid5) or might follow array addresses (raid10), so cannot
> >> > >> in general be used to set ->recovery_offset.  When reshaping backwards,
> >> > >> ->curre_resync* measures from the *end* of the array-or-device, so is
> >> > >> particularly unhelpful.
> >> > >> 
> >> > >> So change the common code in md.c to only use ->curr_resync_complete
> >> > >> for the simple recovery case, and add code to raid5.c to update
> >> > >> ->recovery_offset during a forwards reshape.
> >> > >> 
> >> > >> Signed-off-by: NeilBrown <neilb@suse.com>
> >> > >> ---
> >> > >>  drivers/md/md.c    |   32 +++++++++++++++++++++-----------
> >> > >>  drivers/md/raid5.c |   18 ++++++++++++++++++
> >> > >>  2 files changed, 39 insertions(+), 11 deletions(-)
> >> > >> 
> >> > >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> > >> index a9f1352b3849..d1dfc9879368 100644
> >> > >> --- a/drivers/md/md.c
> >> > >> +++ b/drivers/md/md.c
> >> > >> @@ -2453,10 +2453,17 @@ void md_update_sb(struct mddev *mddev, int force_change)
> >> > >>  		}
> >> > >>  	}
> >> > >>  
> >> > >> -	/* First make sure individual recovery_offsets are correct */
> >> > >> +	/* First make sure individual recovery_offsets are correct
> >> > >> +	 * curr_resync_completed can only be used during recovery.
> >> > >> +	 * During reshape/resync it might use array-addresses rather
> >> > >> +	 * that device addresses.
> >> > >> +	 */
> >> > >>  	rdev_for_each(rdev, mddev) {
> >> > >>  		if (rdev->raid_disk >= 0 &&
> >> > >>  		    mddev->delta_disks >= 0 &&
> >> > >> +		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> >> > >> +		    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
> >> > >> +		    !test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> >> > >>  		    !test_bit(Journal, &rdev->flags) &&
> >> > >>  		    !test_bit(In_sync, &rdev->flags) &&
> >> > >>  		    mddev->curr_resync_completed > rdev->recovery_offset)
> >> > >> @@ -8507,16 +8514,19 @@ void md_do_sync(struct md_thread *thread)
> >> > >>  		} else {
> >> > >>  			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
> >> > >>  				mddev->curr_resync = MaxSector;
> >> > >> -			rcu_read_lock();
> >> > >> -			rdev_for_each_rcu(rdev, mddev)
> >> > >> -				if (rdev->raid_disk >= 0 &&
> >> > >> -				    mddev->delta_disks >= 0 &&
> >> > >> -				    !test_bit(Journal, &rdev->flags) &&
> >> > >> -				    !test_bit(Faulty, &rdev->flags) &&
> >> > >> -				    !test_bit(In_sync, &rdev->flags) &&
> >> > >> -				    rdev->recovery_offset < mddev->curr_resync)
> >> > >> -					rdev->recovery_offset = mddev->curr_resync;
> >> > >> -			rcu_read_unlock();
> >> > >> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> >> > >> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
> >> > >> +				rcu_read_lock();
> >> > >> +				rdev_for_each_rcu(rdev, mddev)
> >> > >> +					if (rdev->raid_disk >= 0 &&
> >> > >> +					    mddev->delta_disks >= 0 &&
> >> > >> +					    !test_bit(Journal, &rdev->flags) &&
> >> > >> +					    !test_bit(Faulty, &rdev->flags) &&
> >> > >> +					    !test_bit(In_sync, &rdev->flags) &&
> >> > >> +					    rdev->recovery_offset < mddev->curr_resync)
> >> > >> +						rdev->recovery_offset = mddev->curr_resync;
> >> > >> +				rcu_read_unlock();
> >> > >> +			}
> >> > >>  		}
> >> > >>  	}
> >> > >>   skip:
> >> > >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> > >> index a8df52130f8a..89ad79614309 100644
> >> > >> --- a/drivers/md/raid5.c
> >> > >> +++ b/drivers/md/raid5.c
> >> > >> @@ -5739,6 +5739,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> >> > >>  	 */
> >> > >>  	struct r5conf *conf = mddev->private;
> >> > >>  	struct stripe_head *sh;
> >> > >> +	struct md_rdev *rdev;
> >> > >>  	sector_t first_sector, last_sector;
> >> > >>  	int raid_disks = conf->previous_raid_disks;
> >> > >>  	int data_disks = raid_disks - conf->max_degraded;
> >> > >> @@ -5861,6 +5862,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> >> > >>  			return 0;
> >> > >>  		mddev->reshape_position = conf->reshape_progress;
> >> > >>  		mddev->curr_resync_completed = sector_nr;
> >> > >> +		if (!mddev->reshape_backwards)
> >> > >> +			/* Can update recovery_offset */
> >> > >> +			rdev_for_each(rdev, mddev)
> >> > >> +				if (rdev->raid_disk >= 0 &&
> >> > >> +				    !test_bit(Journal, &rdev->flags) &&
> >> > >> +				    !test_bit(In_sync, &rdev->flags) &&
> >> > >> +				    rdev->recovery_offset < sector_nr)
> >> > >> +					rdev->recovery_offset = sector_nr;
> >> > >> +
> >> > >>  		conf->reshape_checkpoint = jiffies;
> >> > >>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> >> > >>  		md_wakeup_thread(mddev->thread);
> >> > >> @@ -5959,6 +5969,14 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> >> > >>  			goto ret;
> >> > >>  		mddev->reshape_position = conf->reshape_progress;
> >> > >>  		mddev->curr_resync_completed = sector_nr;
> >> > >> +		if (!mddev->reshape_backwards)
> >> > >> +			/* Can update recovery_offset */
> >> > >> +			rdev_for_each(rdev, mddev)
> >> > >> +				if (rdev->raid_disk >= 0 &&
> >> > >> +				    !test_bit(Journal, &rdev->flags) &&
> >> > >> +				    !test_bit(In_sync, &rdev->flags) &&
> >> > >> +				    rdev->recovery_offset < sector_nr)
> >> > >> +					rdev->recovery_offset = sector_nr;
> >> > >>  		conf->reshape_checkpoint = jiffies;
> >> > >>  		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> >> > >>  		md_wakeup_thread(mddev->thread);
> >> > >> 
> >> > >> 
> >> > >> --
> >> > >
> >> > > Hi,
> >> > >
> >> > > This patch has caused a regression in following IMSM RAID scenario:
> >> > 
> >> > That's unfortunate.
> >> > 
> >> > >
> >> > > mdadm --create /dev/md/imsm0 --metadata=imsm --raid-devices=4 /dev/nvme[0-3]n1 --run
> >> > > mdadm --create /dev/md/r10 --level=10 --size=100M --raid-devices=4 /dev/nvme[0-3]n1 --run --assume-clean
> >> > >
> >> > > MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/r10 --level=0
> >> > > MDADM_EXPERIMENTAL=1 mdadm --grow /dev/md/imsm0 --raid-devices=4
> >> > >
> >> > > cat /proc/mdstat 
> >> > > Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] 
> >> > > md126 : active raid4 nvme0n1[3] nvme2n1[2] nvme1n1[1]
> >> > >       409600 blocks super external:/md127/0 level 4, 128k chunk, algorithm 0 [5/3] [UU_U_]
> >> > >       
> >> > > md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
> >> > >       4420 blocks super external:imsm
> >> > 
> >> > Why isn't nvme3n1 in md126?  Did it not get added, or did it get removed
> >> > for some reason?
> >> > If it wasn't added, raid5_start_reshape() would have failed, so I guess
> >> > it must have been removed??
> >> 
> >> It got removed. It's in the array during the reshape:
> >> 
> >> cat /proc/mdstat 
> >> Personalities : [raid10] [raid0] [raid6] [raid5] [raid4] [raid1] 
> >> md126 : active raid4 nvme3n1[4] nvme0n1[3] nvme2n1[2] nvme1n1[1]
> >>       4194304 blocks super external:-md127/0 level 4, 128k chunk, algorithm 5 [5/4] [UU_U_]
> >>       [===>.................]  reshape = 16.6% (350476/2097152) finish=0.1min speed=175238K/sec
> >>       
> >> md127 : inactive nvme3n1[3](S) nvme2n1[2](S) nvme1n1[1](S) nvme0n1[0](S)
> >>       4420 blocks super external:imsm
> >> 
> >> > 
> >> > >
> >> > > The array should have been reshaped to 4-disk RAID0 but it has failed in
> >> > > the middle of reshape as RAID4.
> >> > >
> >> > > Following condition is not true for above scenario:
> >> > >
> >> > >> +			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> >> > >> +			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
> >> > 
> >> > No, it isn't meant to be.  This code doesn't work reliably in the
> >> > reshape case. When reshaping raid5 (or raid4), reshape_request updates
> >> > rdev->recovery_offset. ... maybe end_reshape needs to update it too.
> >> > 
> >> > If you add this patch, does it help?
> >> >
> >> > 
> >> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> > index 89ad79614309..a99ae1be9175 100644
> >> > --- a/drivers/md/raid5.c
> >> > +++ b/drivers/md/raid5.c
> >> > @@ -7964,6 +7964,7 @@ static void end_reshape(struct r5conf *conf)
> >> >  {
> >> >  
> >> >  	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
> >> > +		struct md_rdev *rdev;
> >> >  
> >> >  		spin_lock_irq(&conf->device_lock);
> >> >  		conf->previous_raid_disks = conf->raid_disks;
> >> > @@ -7971,6 +7972,12 @@ static void end_reshape(struct r5conf *conf)
> >> >  		smp_wmb();
> >> >  		conf->reshape_progress = MaxSector;
> >> >  		conf->mddev->reshape_position = MaxSector;
> >> > +		rdev_for_each(rdev, conf->mddev)
> >> > +			if (rdev->raid_disk >= 0 &&
> >> > +			    !test_bit(Journal, &rdev->flags) &&
> >> > +			    !test_bit(In_sync, &rdev->flags))
> >> > +				rdev->recovery_offset = MaxSector;
> >> > +
> >> >  		spin_unlock_irq(&conf->device_lock);
> >> >  		wake_up(&conf->wait_for_overlap);
> >> >  
> >> 
> >> Yes, this patch fixes the problem. I have also checked that reshape position
> >> is really stored and reshape interrupted in the middle restarts from the
> >> last checkpoint.
> >
> > Neil,
> > can you send me a formal patch for this? or I can just fold this into your
> > original patch.
> 
> If you can fold it in, that would be great.  Otherwise it may be a while
> before I get to it.

Ok, done.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-11-09 15:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-17  5:18 [md PATCH 0/2] Fix problems with reshape on degraded raid5/6 NeilBrown
2017-10-17  5:18 ` [md PATCH 1/2] raid5: Set R5_Expanded on parity devices as well as data NeilBrown
2017-10-17  5:18 ` [md PATCH 2/2] md: be cautious about using ->curr_resync_completed for ->recovery_offset NeilBrown
2017-10-31 14:42   ` Tomasz Majchrzak
2017-11-01  1:13     ` NeilBrown
2017-11-02  8:40       ` Tomasz Majchrzak
2017-11-09  0:08         ` Shaohua Li
2017-11-09  7:55           ` NeilBrown
2017-11-09 15:31             ` Shaohua Li
2017-10-19  3:06 ` [md PATCH 0/2] Fix problems with reshape on degraded raid5/6 Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.