linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
@ 2020-11-05 13:11 Zhao Heming
  2020-11-05 13:11 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
  2020-11-07  0:17 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Song Liu
  0 siblings, 2 replies; 8+ messages in thread
From: Zhao Heming @ 2020-11-05 13:11 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang
  Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli

Test script (reproducible steps):
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
mdadm --zero-superblock /dev/sd{g,h,i}
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

sdg/sdh/sdi are 1GB iscsi luns. The shared disks size is more large the
issue is more likely to trigger.

There is a workaround: when adding the --wait before second --grow,
this bug will disappear.

There are some different test results after running script:
(output by: mdadm -D /dev/md0)
<case 1> : normal test result.
```
    Number   Major   Minor   RaidDevice State
       1       8      112        0      active sync   /dev/sdh
       2       8      128        1      active sync   /dev/sdi
```

<case 2> : "--faulty" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi

       0       8       96        -      faulty   /dev/sdg
```

<case 3> : "--remove" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi
```

Rootcause:
In md-cluster env, it doesn't promise the reshape action (by --grow)
must take place on current node. Any node in cluster has ability
to start resync action, which may be triggered by other node --grow cmd.
md-cluster just uses resync_lockres to make sure only one node can do
resync job.

The key related code (with my patch) is:
```
     if (mddev->sync_thread ||
         test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+        test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
         mddev->reshape_position != MaxSector)
         return -EBUSY;
```
Without test_bit MD_RESYNCING_REMOTE, the 'if' area only handle local
recovery/resync event.
In this bug, the resyncing was doing on another node (let us call it
node2). The initiator side (let us call it node1) start "--grow" cmd, it
calls raid1_reshape and return successfully, (please note node1 doesn't
do resync job). But in node2 (which does resync job), for handling
METADATA_UPDATED (sent by node1), the related code flow:
```
process_metadata_update
 md_reload_sb
  check_sb_changes
   update_raid_disks
```
update_raid_disks returns -EBUSY, but check_sb_changes doesn't handle
return value. So the reshape job doesn't be done by node2. At last node2
will use legacy data (e.g. rdev->raid_disks) to update disk metadata.

How to fix:
The simple & clear solution is block the reshape action in initiator
side. When node is executing "--grow" and detecting there is ongoing
resyncing, it should immediately return & report error to user space.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..74280e353b8f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 		return -EINVAL;
 	if (mddev->sync_thread ||
 	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+		test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
 	    mddev->reshape_position != MaxSector)
 		return -EBUSY;
 
@@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}
 
-	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
-		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
+		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+		if (ret)
+			pr_warn("md: updating array disks failed. %d\n", ret);
+	}
 
 	/*
 	 * Since mddev->delta_disks has already updated in update_raid_disks,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-05 13:11 [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
@ 2020-11-05 13:11 ` Zhao Heming
  2020-11-07  0:17 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Song Liu
  1 sibling, 0 replies; 8+ messages in thread
From: Zhao Heming @ 2020-11-05 13:11 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang
  Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli

The hunging is very easy to trigger by following script.

Test script (reproducible steps):
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
mdadm --zero-superblock /dev/sd{g,h,i}
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

sdg/sdh/sdi are 1GB iscsi luns. The disks size is more large the
issue is more likely to trigger.

Hunging info:
```
node1 # ps axj | grep mdadm
1  5423  5227  2231 ?    -1 D   0   0:00 mdadm /dev/md0 --remove /dev/sdg

node1 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdh[1] sdg[0](F)
      1046528 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (1/1046528)
finish=354.0min speed=47K/sec
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

node2 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdg[0](F) sdh[1]
      1046528 blocks super 1.2 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>
```

Analysis (timing order):

     node A                          node  B
-----------------------------------------------
                                    <1>
                                    sendmsg: METADATA_UPDATED
                                    - token_lockres:EX
 <2>
 sendmsg: RESYNCING
 - set MD_CLUSTER_SEND_LOCK:1
 - wait for holding token_lockres:EX

 <3.1>
 mdadm /dev/md0 --remove /dev/sdg
 mddev_lock(mddev)
 sendmsg: REMOVE
 - wait for MD_CLUSTER_SEND_LOCK:0

 <3.2>
 recive METADATA_UPDATED from B
 wait for mddev_trylock(mddev) to return 1

Explaination:
1>
node B send METADATA_UPDATED msg.
this will block all other nodes to send msg in cluster env.

2>
node A does sync jobs, so it will send RESYNCING msg at intervals.
this will be block for holding token_lockres:EX lock.

md_do_sync
 raid1_sync_request
  resync_info_update
   sendmsg //with mddev_locked: false
    lock_comm
     + wait_event(cinfo->wait,
     |    !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
     + lock_token //for step<1>, block holding EX lock
        + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking

3.1>
node A do "--remove" action and will send REMOVE msg.
this will be blocked by step <2>: MD_CLUSTER_SEND_LOCK is 1.

md_ioctl
+ mddev_lock(mddev) //holding mddev lock, it influnces <3.2>
+ hot_remove_disk
   remove_disk
    sendmsg
     lock_comm
       wait_event(cinfo->wait,
         !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking

3.2>
node A recv METADATA_UPDATED msg which send from node B in step <1>.
this will be blocked by step <3.1>: holding mddev lock, it makes
wait_event can't hold mddev lock. (btw,
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

process_metadata_update
  wait_event(mddev->thread->wqueue,
        (got_lock = mddev_trylock(mddev)) ||
        test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));

steps: 1, 2, 3.1 & 3.2 lead to a deadlock.

How to fix
There are two sides to fix (or break the dead loop):
1. on sending msg side, modify lock_comm, change it to return
   success/failed.
   This will make mdadm cmd return error when lock_comm is timeout.
2. on receiving msg side, process_metadata_update need to add error
   handling.
   currently, other msg types won't trigger error or error doesn't need
   to return sender. So only process_metadata_update need to modify.

Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 4aaf4820b6f6..d59a033e7589 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -523,19 +523,24 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
 }
 
 
-static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
+static int process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
 {
-	int got_lock = 0;
+	int got_lock = 0, rv;
 	struct md_cluster_info *cinfo = mddev->cluster_info;
 	mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
 
 	dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
-	wait_event(mddev->thread->wqueue,
-		   (got_lock = mddev_trylock(mddev)) ||
-		    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
-	md_reload_sb(mddev, mddev->good_device_nr);
-	if (got_lock)
-		mddev_unlock(mddev);
+	rv = wait_event_timeout(mddev->thread->wqueue,
+			   (got_lock = mddev_trylock(mddev)) ||
+			   test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state),
+			   msecs_to_jiffies(5000));
+	if (rv) {
+		md_reload_sb(mddev, mddev->good_device_nr);
+		if (got_lock)
+			mddev_unlock(mddev);
+		return 0;
+	}
+	return -1;
 }
 
 static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
@@ -578,7 +583,7 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg)
 		return -1;
 	switch (le32_to_cpu(msg->type)) {
 	case METADATA_UPDATED:
-		process_metadata_update(mddev, msg);
+		ret = process_metadata_update(mddev, msg);
 		break;
 	case CHANGE_CAPACITY:
 		set_capacity(mddev->gendisk, mddev->array_sectors);
@@ -701,10 +706,15 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
  */
 static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
 {
-	wait_event(cinfo->wait,
-		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
+	int rv;
 
-	return lock_token(cinfo, mddev_locked);
+	rv = wait_event_timeout(cinfo->wait,
+			   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state),
+			   msecs_to_jiffies(5000));
+	if (rv)
+		return lock_token(cinfo, mddev_locked);
+	else
+		return -1;
 }
 
 static void unlock_comm(struct md_cluster_info *cinfo)
@@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
 {
 	int ret;
 
-	lock_comm(cinfo, mddev_locked);
-	ret = __sendmsg(cinfo, cmsg);
-	unlock_comm(cinfo);
+	ret = lock_comm(cinfo, mddev_locked);
+	if (!ret) {
+		ret = __sendmsg(cinfo, cmsg);
+		unlock_comm(cinfo);
+	}
 	return ret;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-05 13:11 [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
  2020-11-05 13:11 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
@ 2020-11-07  0:17 ` Song Liu
  2020-11-07  3:53   ` heming.zhao
  1 sibling, 1 reply; 8+ messages in thread
From: Song Liu @ 2020-11-07  0:17 UTC (permalink / raw)
  To: Zhao Heming
  Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li

On Thu, Nov 5, 2020 at 5:11 AM Zhao Heming <heming.zhao@suse.com> wrote:
>
> Test script (reproducible steps):
> ```
> ssh root@node2 "mdadm -S --scan"
> mdadm -S --scan
> mdadm --zero-superblock /dev/sd{g,h,i}
> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
> count=20; done
>
> echo "mdadm create array"
> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
> echo "set up array on node2"
> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
>
> sleep 5
>
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --wait /dev/md0
> mdadm --grow --raid-devices=3 /dev/md0
>
> mdadm /dev/md0 --fail /dev/sdg
> mdadm /dev/md0 --remove /dev/sdg
>  #mdadm --wait /dev/md0
> mdadm --grow --raid-devices=2 /dev/md0
> ```

I found it was hard for me to follow this set. IIUC, the two patches try to
address one issue. Please add a cover letter and reorganize the descriptions
like:

  cover-letter: error behavior, repro steps, analysis, and maybe describe the
             relationship of the two patches.
  1/2 and 2/2: what is being fixed.

Thanks,
Song

[...]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-07  0:17 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Song Liu
@ 2020-11-07  3:53   ` heming.zhao
  0 siblings, 0 replies; 8+ messages in thread
From: heming.zhao @ 2020-11-07  3:53 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li

Hello Song,

OK, I will add a cover letter with more descriptions & resend these patches.

Though the test scripts almost same, there are two different bugs. 
patch 1/2 fixes --grow wrong behaviour. (the bug happened after second --grow cmd executing)
patch 2/2 fixes md-cluster deadlock. (the deadlock happened before second --grow cmd)

The patch 1/2 bug was came from one of SUSE customers. When I finished bugfix and ran test script to verify,
I triggered patch 2/2 bug. 

test script of patch 2/2 adds "--bitmap-chunk=1M" in creating mdadm & mkfs.xfs after setup array. 
These two steps make array to do more resync work. More resync time give lager time window (more opportunities) 
to trigger deadlock.

Thanks.

On 11/7/20 8:17 AM, Song Liu wrote:
> On Thu, Nov 5, 2020 at 5:11 AM Zhao Heming <heming.zhao@suse.com> wrote:
>>
>> Test script (reproducible steps):
>> ```
>> ssh root@node2 "mdadm -S --scan"
>> mdadm -S --scan
>> mdadm --zero-superblock /dev/sd{g,h,i}
>> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
>> count=20; done
>>
>> echo "mdadm create array"
>> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
>> echo "set up array on node2"
>> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
>>
>> sleep 5
>>
>> mdadm --manage --add /dev/md0 /dev/sdi
>> mdadm --wait /dev/md0
>> mdadm --grow --raid-devices=3 /dev/md0
>>
>> mdadm /dev/md0 --fail /dev/sdg
>> mdadm /dev/md0 --remove /dev/sdg
>>   #mdadm --wait /dev/md0
>> mdadm --grow --raid-devices=2 /dev/md0
>> ```
> 
> I found it was hard for me to follow this set. IIUC, the two patches try to
> address one issue. Please add a cover letter and reorganize the descriptions
> like:
> 
>    cover-letter: error behavior, repro steps, analysis, and maybe describe the
>               relationship of the two patches.
>    1/2 and 2/2: what is being fixed.
> 
> Thanks,
> Song
> 
> [...]
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-10  6:38   ` Guoqing Jiang
@ 2020-11-10  6:59     ` heming.zhao
  0 siblings, 0 replies; 8+ messages in thread
From: heming.zhao @ 2020-11-10  6:59 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli

On 11/10/20 2:38 PM, Guoqing Jiang wrote:
> 
> 
> On 11/8/20 15:53, Zhao Heming wrote:
>> Test script (reproducible steps):
>> ```
>> ssh root@node2 "mdadm -S --scan"
>> mdadm -S --scan
>> mdadm --zero-superblock /dev/sd{g,h,i}
>> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
>> count=20; done
>>
>> echo "mdadm create array"
>> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
>> echo "set up array on node2"
>> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
>>
>> sleep 5
>>
>> mdadm --manage --add /dev/md0 /dev/sdi
>> mdadm --wait /dev/md0
>> mdadm --grow --raid-devices=3 /dev/md0
>>
>> mdadm /dev/md0 --fail /dev/sdg
>> mdadm /dev/md0 --remove /dev/sdg
>>   #mdadm --wait /dev/md0
>> mdadm --grow --raid-devices=2 /dev/md0
>> ```
>>
> 
> What is the result after the above steps? Deadlock or something else.

The result was writen in cover-letter, in the "*** error behavior ***".
I will add the result as comments in V2 patch.

> 
>> node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and
>> the disk size is more large the issue is more likely to trigger. (more
>> resync time, more easily trigger issues)
>>
>> There is a workaround:
>> when adding the --wait before second --grow, the issue 1 will disappear.
>>
>> ... ...
>> +        if (ret)
>> +            pr_warn("md: updating array disks failed. %d\n", ret);
>> +    }
>>       /*
>>        * Since mddev->delta_disks has already updated in update_raid_disks,
>>
> 
> Generally, I think it is good.
> 
> Thanks,
> Guoqing
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
  2020-11-09 18:01   ` Song Liu
@ 2020-11-10  6:38   ` Guoqing Jiang
  2020-11-10  6:59     ` heming.zhao
  1 sibling, 1 reply; 8+ messages in thread
From: Guoqing Jiang @ 2020-11-10  6:38 UTC (permalink / raw)
  To: Zhao Heming, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli



On 11/8/20 15:53, Zhao Heming wrote:
> Test script (reproducible steps):
> ```
> ssh root@node2 "mdadm -S --scan"
> mdadm -S --scan
> mdadm --zero-superblock /dev/sd{g,h,i}
> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
> count=20; done
> 
> echo "mdadm create array"
> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
> echo "set up array on node2"
> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
> 
> sleep 5
> 
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --wait /dev/md0
> mdadm --grow --raid-devices=3 /dev/md0
> 
> mdadm /dev/md0 --fail /dev/sdg
> mdadm /dev/md0 --remove /dev/sdg
>   #mdadm --wait /dev/md0
> mdadm --grow --raid-devices=2 /dev/md0
> ```
> 

What is the result after the above steps? Deadlock or something else.

> node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and
> the disk size is more large the issue is more likely to trigger. (more
> resync time, more easily trigger issues)
> 
> There is a workaround:
> when adding the --wait before second --grow, the issue 1 will disappear.
> 
> Rootcause:
> 
> In cluster env, every node can start resync job even if the resync cmd
> doesn't execute on it.
> e.g.
> There are two node A & B. User executes "mdadm --grow" on A, sometime B
> will start resync job not A.
> 
> Current update_raid_disks() only check local recovery status, it's
> incomplete.
> 
> How to fix:
> 
> The simple & clear solution is block the reshape action in initiator
> side. When node is executing "--grow" and detecting there is ongoing
> resyncing, it should immediately return & report error to user space.
> 
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> ---
>   drivers/md/md.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..74280e353b8f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
>   		return -EINVAL;
>   	if (mddev->sync_thread ||
>   	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
> +		test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
>   	    mddev->reshape_position != MaxSector)
>   		return -EBUSY;
>   
> @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
>   		}
>   	}
>   
> -	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
> -		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
> +		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +		if (ret)
> +			pr_warn("md: updating array disks failed. %d\n", ret);
> +	}
>   
>   	/*
>   	 * Since mddev->delta_disks has already updated in update_raid_disks,
> 

Generally, I think it is good.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
@ 2020-11-09 18:01   ` Song Liu
  2020-11-10  6:38   ` Guoqing Jiang
  1 sibling, 0 replies; 8+ messages in thread
From: Song Liu @ 2020-11-09 18:01 UTC (permalink / raw)
  To: Zhao Heming
  Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li

On Sun, Nov 8, 2020 at 6:53 AM Zhao Heming <heming.zhao@suse.com> wrote:
>
[...]

> How to fix:
>
> The simple & clear solution is block the reshape action in initiator
> side. When node is executing "--grow" and detecting there is ongoing
> resyncing, it should immediately return & report error to user space.
>
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>

The code looks good to me. But please revise the commit log as something
similar to the following.

========================== 8< ==========================
md/cluster: block reshape requests with resync job initiated from remote node

In cluster env, a node can start resync job when the resync cmd was executed
on a different node. Reshape requests should be blocked for resync job initiated
by any node. Current code only <condition to block reshape requests>.
This results
in a dead lock in <condition> (see repro steps below). Fix this by <adding the
extra check>.

Repro steps:
...
========================== 8< ==========================

In this way, whoever reading the commit log, which could be yourself in 2021,
will understand the primary goal of this change quickly.

Does this make sense?

Thanks,
Song

> ---
>  drivers/md/md.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..74280e353b8f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
>                 return -EINVAL;
>         if (mddev->sync_thread ||
>             test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
> +               test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
>             mddev->reshape_position != MaxSector)
>                 return -EBUSY;
>
> @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
>                 }
>         }
>
> -       if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
> -               update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +       if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
> +               ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +               if (ret)
> +                       pr_warn("md: updating array disks failed. %d\n", ret);
> +       }
>
>         /*
>          * Since mddev->delta_disks has already updated in update_raid_disks,
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming
@ 2020-11-08 14:53 ` Zhao Heming
  2020-11-09 18:01   ` Song Liu
  2020-11-10  6:38   ` Guoqing Jiang
  0 siblings, 2 replies; 8+ messages in thread
From: Zhao Heming @ 2020-11-08 14:53 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang
  Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli

Test script (reproducible steps):
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
mdadm --zero-superblock /dev/sd{g,h,i}
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and
the disk size is more large the issue is more likely to trigger. (more
resync time, more easily trigger issues)

There is a workaround:
when adding the --wait before second --grow, the issue 1 will disappear.

Rootcause:

In cluster env, every node can start resync job even if the resync cmd
doesn't execute on it.
e.g.
There are two node A & B. User executes "mdadm --grow" on A, sometime B
will start resync job not A.

Current update_raid_disks() only check local recovery status, it's
incomplete.

How to fix:

The simple & clear solution is block the reshape action in initiator
side. When node is executing "--grow" and detecting there is ongoing
resyncing, it should immediately return & report error to user space.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..74280e353b8f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 		return -EINVAL;
 	if (mddev->sync_thread ||
 	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+		test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
 	    mddev->reshape_position != MaxSector)
 		return -EBUSY;
 
@@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}
 
-	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
-		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
+		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+		if (ret)
+			pr_warn("md: updating array disks failed. %d\n", ret);
+	}
 
 	/*
 	 * Since mddev->delta_disks has already updated in update_raid_disks,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-11-10  6:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-05 13:11 [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-05 13:11 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
2020-11-07  0:17 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Song Liu
2020-11-07  3:53   ` heming.zhao
2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming
2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-09 18:01   ` Song Liu
2020-11-10  6:38   ` Guoqing Jiang
2020-11-10  6:59     ` heming.zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).