All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] md-cluster bugs fix
@ 2020-11-08 14:52 Zhao Heming
  2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Zhao Heming @ 2020-11-08 14:52 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang
  Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli

Hello List,

I filed two patches to fix 2 different md-cluster bugs.
For easy understanding, Let us call issue 1 (releted with patch 1),
and issue 2 (related with patch 2).

*** Test env ***

node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. The disk
size is more large the issues are more likely to trigger. 
(more resync time, more easily trigger issues)

*** Test script ***

Issue 1 & 2 can use same test script to trigger:

```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

There is a workaround: 
when adding the --wait before second --grow, the issue 1 will disappear.


*** error behavior ***

issue 1: test script can finish every cmds in script, but array status is wrong.
issue 2: test script will hung when executing "mdadm --remove".

array status of issue 1:
(part of output by: mdadm -D /dev/md0)
<case 1> : normal test result.
```
    Number   Major   Minor   RaidDevice State
       1       8      112        0      active sync   /dev/sdh
       2       8      128        1      active sync   /dev/sdi
```

<case 2> : "--faulty" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi

       0       8       96        -      faulty   /dev/sdg
```

<case 3> : "--remove" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi
```

array status of issue 2:
Hunging info:
```
node1 # ps axj | grep mdadm
1  5423  5227  2231 ?    -1 D   0   0:00 mdadm /dev/md0 --remove /dev/sdg

node1 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdh[1] sdg[0](F)
      1046528 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (1/1046528)
finish=354.0min speed=47K/sec
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

node2 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdg[0](F) sdh[1]
      1046528 blocks super 1.2 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

echo t > /proc/sysrq-trigger
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

*** analysis ***

<issue 1>

In cluster env, every node can start resync job even if the resync cmd doesn't execute on it.
e.g.
There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A.

problem:
Current update_raid_disks() only check local recovery status, it's incomplete.

issue scenario
```
  node A                           node B
------------------------------   ---------------
mdadm --grow -n 3 md0
 + raid1_reshape
    mddev->raid_disks: 2=>3
                                 start resync job, it will block A resync job
                                 mddev->raid_disks: 2=>3

mdadm md0 --fail sdg
 + update disk: array sb & bitmap sb
 + send METADATA_UPDATE
 (resync job blocking)
                                 (B continue doing "--grow -n 3" resync job)
                                 recv METADATA_UPDATE
                                  + read disk metadata
                                  + raid1_error
                                  + set MD_RECOVERY_INTR to break resync
                                 ... ...
                                 md_check_recovery
                                  + remove_and_add_spares return 1
                                  + set MD_RECOVERY_RECOVER, later restart resync

mdadm md0 --remove sdg
 + md_cluster_ops->remove_disk
 |  + send REMOVE
 + md_kick_rdev_from_array
 + update disk: array sb & bitmap sb
 (resync job blocking)
                                 (B continue doing "--grow -n 3" resync job)
                                 recv REMOVE
                                  + process_remove_disk doesn't set mddev->sb_flags, 
                                     so it doesn't update disk sb & bitmap sb.
                                 ......
                                 md_check_recovery
                                  + md_kick_rdev_from_array 

mdadm --grow -n 2 md0
 + raid1_reshape
 |  mddev->raid_disks: 3=>2
 + send METADATA_UPDATED

                                 (B continue doing "--grow -n 3" resync job)
                                 recv METADATA_UPDATE
                                  + check_sb_changes
                                     update_raid_disks return -EBUSY
                                     update failed for mddev->raid_disks: 3=>2


                                (B never successfully update mddev->raid_disks: 3=>2)
                                when B finish "--grow -n 3" resync job
                                 + use mddev->raid_disks:3 to update array sb & bitmap sb
                                 + send METADATA_UPDATED

recv METADATA_UPDATED
 + read wrong raid_disks to update
   kernel data.
```

<issue 2>

First, There is a similar deadlock in commit 0ba959774e93911caff596de6391f085fb640ac4

Let me explain commit 0ba959774e first.
```
<origin scenario>
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX
                         
                         c.
                         update sb
                          + held reconfig_mutex
                          + metadata_update_start
                             + wait_event(MD_CLUSTER_SEND_LOCK) //blocking from <b>

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + mddev_trylock(mddev) return false //blocking from <c>


<after introduction "MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD">
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX
                         
                         c.
                         update sb
                          + held reconfig_mutex
                          + metadata_update_start wait for 
                             + set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD //for breaking <d>
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                              //this will non-block & break deadlock.
```

the issue 2 is very like 0ba959774e, except <c>.
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever. 
```

commit 0ba959774e9391 uses MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD to break deadlock, but in issue 2, it won't help.
md_cluster_ops->remove_disk called from:
- state_store()      this doesn't hold reconfig_mutex
- hot_remove_disk()  this must hold reconfig_mutex

There are two method to fix.
1. like commit 0ba959774e, held reconfig_mutex in state_store, and set 
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in remove_disk.
2. like patch 2, change wait_event to wait_event_timeout in lock_comm &
   process_metadata_update.

there are some reason I prefer method 2:
- I am not very familiar with all scenario in state_store(). 
  I am not sure if holding reconfig_mutex can cause new bug/issue.
- It looks all sendmsg cases could trigger issue 2. 
  Current we found two cases: (maybe there have other cases)
  - update sb (see commit 0ba959774)
  - mdadm --remove (issue 2)
  we should break the deadlock in key code (wait_event => wait_event_timeout).

-------
v1:
- create patch

-------
Zhao Heming (2):
  md/cluster: reshape should returns error when remote doing resyncing
    job
  md/cluster: fix deadlock when doing reshape job

 drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
 drivers/md/md.c         |  8 ++++++--
 2 files changed, 33 insertions(+), 17 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming
@ 2020-11-08 14:53 ` Zhao Heming
  2020-11-09 18:01   ` Song Liu
  2020-11-10  6:38   ` Guoqing Jiang
  2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
  2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu
  2 siblings, 2 replies; 12+ messages in thread
From: Zhao Heming @ 2020-11-08 14:53 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang
  Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli

Test script (reproducible steps):
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
mdadm --zero-superblock /dev/sd{g,h,i}
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and
the disk size is more large the issue is more likely to trigger. (more
resync time, more easily trigger issues)

There is a workaround:
when adding the --wait before second --grow, the issue 1 will disappear.

Rootcause:

In cluster env, every node can start resync job even if the resync cmd
doesn't execute on it.
e.g.
There are two node A & B. User executes "mdadm --grow" on A, sometime B
will start resync job not A.

Current update_raid_disks() only check local recovery status, it's
incomplete.

How to fix:

The simple & clear solution is block the reshape action in initiator
side. When node is executing "--grow" and detecting there is ongoing
resyncing, it should immediately return & report error to user space.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..74280e353b8f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 		return -EINVAL;
 	if (mddev->sync_thread ||
 	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+		test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
 	    mddev->reshape_position != MaxSector)
 		return -EBUSY;
 
@@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}
 
-	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
-		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
+		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+		if (ret)
+			pr_warn("md: updating array disks failed. %d\n", ret);
+	}
 
 	/*
 	 * Since mddev->delta_disks has already updated in update_raid_disks,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming
  2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
@ 2020-11-08 14:53 ` Zhao Heming
  2020-11-09  2:02   ` heming.zhao
  2020-11-10  6:36   ` Guoqing Jiang
  2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu
  2 siblings, 2 replies; 12+ messages in thread
From: Zhao Heming @ 2020-11-08 14:53 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang
  Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli

There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")

This issue is very like 0ba959774e93, except <c>.
0ba959774e93 step c is "update sb", this issue is "mdadm --remove"

```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever.
```

Explaination:

a>
A send METADATA_UPDATED
this will block all other nodes to send msg in cluster.

b>
B does sync jobs, so it will send RESYNCING at intervals.
this will be block for holding token_lockres:EX lock.
```
md_do_sync
 raid1_sync_request
  resync_info_update
   sendmsg //with mddev_locked: false
    lock_comm
     + wait_event(cinfo->wait,
     |    !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
     + lock_token //for step<a>, block holding EX lock
        + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking
```
c>
B do "--remove" action and will send REMOVE.
this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
```
md_ioctl
+ mddev_lock(mddev) //holding reconfig_mutex, it influnces <d>
+ hot_remove_disk
   remove_disk
    sendmsg
     lock_comm
       wait_event(cinfo->wait,
         !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking
```
d>
B recv METADATA_UPDATED msg which send from A in step <a>.
this will be blocked by step <c>: holding mddev lock, it makes
wait_event can't hold mddev lock. (btw,
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
```
process_metadata_update
  wait_event(mddev->thread->wqueue,
        (got_lock = mddev_trylock(mddev)) ||
        test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
```

How to fix:

There are two sides to fix (or break the dead loop):
1. on sending msg side, modify lock_comm, change it to return
   success/failed.
   This will make mdadm cmd return error when lock_comm is timeout.
2. on receiving msg side, process_metadata_update need to add error
   handling.
   currently, other msg types won't trigger error or error doesn't need
   to return sender. So only process_metadata_update need to modify.

Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 4aaf4820b6f6..d59a033e7589 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -523,19 +523,24 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
 }
 
 
-static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
+static int process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
 {
-	int got_lock = 0;
+	int got_lock = 0, rv;
 	struct md_cluster_info *cinfo = mddev->cluster_info;
 	mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
 
 	dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
-	wait_event(mddev->thread->wqueue,
-		   (got_lock = mddev_trylock(mddev)) ||
-		    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
-	md_reload_sb(mddev, mddev->good_device_nr);
-	if (got_lock)
-		mddev_unlock(mddev);
+	rv = wait_event_timeout(mddev->thread->wqueue,
+			   (got_lock = mddev_trylock(mddev)) ||
+			   test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state),
+			   msecs_to_jiffies(5000));
+	if (rv) {
+		md_reload_sb(mddev, mddev->good_device_nr);
+		if (got_lock)
+			mddev_unlock(mddev);
+		return 0;
+	}
+	return -1;
 }
 
 static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
@@ -578,7 +583,7 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg)
 		return -1;
 	switch (le32_to_cpu(msg->type)) {
 	case METADATA_UPDATED:
-		process_metadata_update(mddev, msg);
+		ret = process_metadata_update(mddev, msg);
 		break;
 	case CHANGE_CAPACITY:
 		set_capacity(mddev->gendisk, mddev->array_sectors);
@@ -701,10 +706,15 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
  */
 static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
 {
-	wait_event(cinfo->wait,
-		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
+	int rv;
 
-	return lock_token(cinfo, mddev_locked);
+	rv = wait_event_timeout(cinfo->wait,
+			   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state),
+			   msecs_to_jiffies(5000));
+	if (rv)
+		return lock_token(cinfo, mddev_locked);
+	else
+		return -1;
 }
 
 static void unlock_comm(struct md_cluster_info *cinfo)
@@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
 {
 	int ret;
 
-	lock_comm(cinfo, mddev_locked);
-	ret = __sendmsg(cinfo, cmsg);
-	unlock_comm(cinfo);
+	ret = lock_comm(cinfo, mddev_locked);
+	if (!ret) {
+		ret = __sendmsg(cinfo, cmsg);
+		unlock_comm(cinfo);
+	}
 	return ret;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
@ 2020-11-09  2:02   ` heming.zhao
  2020-11-09 18:06     ` Song Liu
  2020-11-10  6:36   ` Guoqing Jiang
  1 sibling, 1 reply; 12+ messages in thread
From: heming.zhao @ 2020-11-09  2:02 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang; +Cc: lidong.zhong, xni, neilb, colyli

Please note, I gave two solutions for this bug in cover-letter.
This patch uses solution 2. For detail, please check cover-letter.

Thank you.

On 11/8/20 10:53 PM, Zhao Heming wrote:
> There is a similar deadlock in commit 0ba959774e93
> ("md-cluster: use sync way to handle METADATA_UPDATED msg")
> 
> This issue is very like 0ba959774e93, except <c>.
> 0ba959774e93 step c is "update sb", this issue is "mdadm --remove"
> 
> ```
> nodeA                       nodeB
> --------------------     --------------------
> a.
> send METADATA_UPDATED
> held token_lockres:EX
>                           b.
>                           md_do_sync
>                            resync_info_update
>                              send RESYNCING
>                               + set MD_CLUSTER_SEND_LOCK
>                               + wait for holding token_lockres:EX
> 
>                           c.
>                           mdadm /dev/md0 --remove /dev/sdg
>                            + held reconfig_mutex
>                            + send REMOVE
>                               + wait_event(MD_CLUSTER_SEND_LOCK)
> 
>                           d.
>                           recv_daemon //METADATA_UPDATED from A
>                            process_metadata_update
>                             + (mddev_trylock(mddev) ||
>                                MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
>                               //this time, both return false forever.
> ```
> 
> Explaination:
> 
> a>
> A send METADATA_UPDATED
> this will block all other nodes to send msg in cluster.
> 
> b>
> B does sync jobs, so it will send RESYNCING at intervals.
> this will be block for holding token_lockres:EX lock.
> ```
> md_do_sync
>   raid1_sync_request
>    resync_info_update
>     sendmsg //with mddev_locked: false
>      lock_comm
>       + wait_event(cinfo->wait,
>       |    !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
>       + lock_token //for step<a>, block holding EX lock
>          + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking
> ```
> c>
> B do "--remove" action and will send REMOVE.
> this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
> ```
> md_ioctl
> + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d>
> + hot_remove_disk
>     remove_disk
>      sendmsg
>       lock_comm
>         wait_event(cinfo->wait,
>           !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking
> ```
> d>
> B recv METADATA_UPDATED msg which send from A in step <a>.
> this will be blocked by step <c>: holding mddev lock, it makes
> wait_event can't hold mddev lock. (btw,
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
> ```
> process_metadata_update
>    wait_event(mddev->thread->wqueue,
>          (got_lock = mddev_trylock(mddev)) ||
>          test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
> ```
> 
> How to fix:
> 
> There are two sides to fix (or break the dead loop):
> 1. on sending msg side, modify lock_comm, change it to return
>     success/failed.
>     This will make mdadm cmd return error when lock_comm is timeout.
> 2. on receiving msg side, process_metadata_update need to add error
>     handling.
>     currently, other msg types won't trigger error or error doesn't need
>     to return sender. So only process_metadata_update need to modify.
> 
> Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side.
> 
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> ---
>   drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
>   1 file changed, 27 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 4aaf4820b6f6..d59a033e7589 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -523,19 +523,24 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
>   }
>   
>   
> -static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
> +static int process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
>   {
> -	int got_lock = 0;
> +	int got_lock = 0, rv;
>   	struct md_cluster_info *cinfo = mddev->cluster_info;
>   	mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
>   
>   	dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
> -	wait_event(mddev->thread->wqueue,
> -		   (got_lock = mddev_trylock(mddev)) ||
> -		    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
> -	md_reload_sb(mddev, mddev->good_device_nr);
> -	if (got_lock)
> -		mddev_unlock(mddev);
> +	rv = wait_event_timeout(mddev->thread->wqueue,
> +			   (got_lock = mddev_trylock(mddev)) ||
> +			   test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state),
> +			   msecs_to_jiffies(5000));
> +	if (rv) {
> +		md_reload_sb(mddev, mddev->good_device_nr);
> +		if (got_lock)
> +			mddev_unlock(mddev);
> +		return 0;
> +	}
> +	return -1;
>   }
>   
>   static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
> @@ -578,7 +583,7 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg)
>   		return -1;
>   	switch (le32_to_cpu(msg->type)) {
>   	case METADATA_UPDATED:
> -		process_metadata_update(mddev, msg);
> +		ret = process_metadata_update(mddev, msg);
>   		break;
>   	case CHANGE_CAPACITY:
>   		set_capacity(mddev->gendisk, mddev->array_sectors);
> @@ -701,10 +706,15 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
>    */
>   static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
>   {
> -	wait_event(cinfo->wait,
> -		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
> +	int rv;
>   
> -	return lock_token(cinfo, mddev_locked);
> +	rv = wait_event_timeout(cinfo->wait,
> +			   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state),
> +			   msecs_to_jiffies(5000));
> +	if (rv)
> +		return lock_token(cinfo, mddev_locked);
> +	else
> +		return -1;
>   }
>   
>   static void unlock_comm(struct md_cluster_info *cinfo)
> @@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
>   {
>   	int ret;
>   
> -	lock_comm(cinfo, mddev_locked);
> -	ret = __sendmsg(cinfo, cmsg);
> -	unlock_comm(cinfo);
> +	ret = lock_comm(cinfo, mddev_locked);
> +	if (!ret) {
> +		ret = __sendmsg(cinfo, cmsg);
> +		unlock_comm(cinfo);
> +	}
>   	return ret;
>   }
>   
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] md-cluster bugs fix
  2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming
  2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
  2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
@ 2020-11-09 17:43 ` Song Liu
  2 siblings, 0 replies; 12+ messages in thread
From: Song Liu @ 2020-11-09 17:43 UTC (permalink / raw)
  To: Zhao Heming, Guoqing Jiang
  Cc: linux-raid, lidong.zhong, Xiao Ni, NeilBrown, Coly Li

Hi Guoqing,
On Sun, Nov 8, 2020 at 6:53 AM Zhao Heming <heming.zhao@suse.com> wrote:
>
[...]

Could you please help review the set? It looks good to me. But it will
be great to have
your review.

Thanks,
Song

>   - update sb (see commit 0ba959774)
>   - mdadm --remove (issue 2)
>   we should break the deadlock in key code (wait_event => wait_event_timeout).
>
> -------
> v1:
> - create patch
>
> -------
> Zhao Heming (2):
>   md/cluster: reshape should returns error when remote doing resyncing
>     job
>   md/cluster: fix deadlock when doing reshape job
>
>  drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
>  drivers/md/md.c         |  8 ++++++--
>  2 files changed, 33 insertions(+), 17 deletions(-)
>
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
@ 2020-11-09 18:01   ` Song Liu
  2020-11-10  6:38   ` Guoqing Jiang
  1 sibling, 0 replies; 12+ messages in thread
From: Song Liu @ 2020-11-09 18:01 UTC (permalink / raw)
  To: Zhao Heming
  Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li

On Sun, Nov 8, 2020 at 6:53 AM Zhao Heming <heming.zhao@suse.com> wrote:
>
[...]

> How to fix:
>
> The simple & clear solution is block the reshape action in initiator
> side. When node is executing "--grow" and detecting there is ongoing
> resyncing, it should immediately return & report error to user space.
>
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>

The code looks good to me. But please revise the commit log as something
similar to the following.

========================== 8< ==========================
md/cluster: block reshape requests with resync job initiated from remote node

In cluster env, a node can start resync job when the resync cmd was executed
on a different node. Reshape requests should be blocked for resync job initiated
by any node. Current code only <condition to block reshape requests>.
This results
in a dead lock in <condition> (see repro steps below). Fix this by <adding the
extra check>.

Repro steps:
...
========================== 8< ==========================

In this way, whoever reading the commit log, which could be yourself in 2021,
will understand the primary goal of this change quickly.

Does this make sense?

Thanks,
Song

> ---
>  drivers/md/md.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..74280e353b8f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
>                 return -EINVAL;
>         if (mddev->sync_thread ||
>             test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
> +               test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
>             mddev->reshape_position != MaxSector)
>                 return -EBUSY;
>
> @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
>                 }
>         }
>
> -       if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
> -               update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +       if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
> +               ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +               if (ret)
> +                       pr_warn("md: updating array disks failed. %d\n", ret);
> +       }
>
>         /*
>          * Since mddev->delta_disks has already updated in update_raid_disks,
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-09  2:02   ` heming.zhao
@ 2020-11-09 18:06     ` Song Liu
  2020-11-10  2:24       ` heming.zhao
  0 siblings, 1 reply; 12+ messages in thread
From: Song Liu @ 2020-11-09 18:06 UTC (permalink / raw)
  To: heming.zhao
  Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li

On Sun, Nov 8, 2020 at 6:02 PM heming.zhao@suse.com
<heming.zhao@suse.com> wrote:
>
> Please note, I gave two solutions for this bug in cover-letter.
> This patch uses solution 2. For detail, please check cover-letter.
>
> Thank you.
>

[...]

> >
> > How to fix:
> >
> > There are two sides to fix (or break the dead loop):
> > 1. on sending msg side, modify lock_comm, change it to return
> >     success/failed.
> >     This will make mdadm cmd return error when lock_comm is timeout.
> > 2. on receiving msg side, process_metadata_update need to add error
> >     handling.
> >     currently, other msg types won't trigger error or error doesn't need
> >     to return sender. So only process_metadata_update need to modify.
> >
> > Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side.
> >

Similar comments on how to make the commit log easy to understand.
Besides that, please split the change into two commits, for fix #1 and #2
respectively.

Thanks,
Song

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-09 18:06     ` Song Liu
@ 2020-11-10  2:24       ` heming.zhao
  0 siblings, 0 replies; 12+ messages in thread
From: heming.zhao @ 2020-11-10  2:24 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li

On 11/10/20 2:06 AM, Song Liu wrote:
> On Sun, Nov 8, 2020 at 6:02 PM heming.zhao@suse.com
> <heming.zhao@suse.com> wrote:
>>
>> Please note, I gave two solutions for this bug in cover-letter.
>> This patch uses solution 2. For detail, please check cover-letter.
>>
>> Thank you.
>>
> 
> [...]
> 
>>>
>>> How to fix:
>>>
>>> There are two sides to fix (or break the dead loop):
>>> 1. on sending msg side, modify lock_comm, change it to return
>>>      success/failed.
>>>      This will make mdadm cmd return error when lock_comm is timeout.
>>> 2. on receiving msg side, process_metadata_update need to add error
>>>      handling.
>>>      currently, other msg types won't trigger error or error doesn't need
>>>      to return sender. So only process_metadata_update need to modify.
>>>
>>> Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side.
>>>
> 
> Similar comments on how to make the commit log easy to understand.
> Besides that, please split the change into two commits, for fix #1 and #2
> respectively.
> 

My comment meaning is that solution 2 also has two sub-solutions: sending side or receiving side.
(but in fact, there are 3 sub-solutions: sending, receiving & both sides)

sending side, related with patch 2 functions: sendmsg & lock_comm
 (code flow: sendmsg => lock_comm)

receiving side, related with patch 2 functions: process_recvd_msg & process_metadata_update
 (code flow: process_recvd_msg => process_metadata_update)

To break any side waiting can break deadlock. In the patch 2, my fix is both sides.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
  2020-11-09  2:02   ` heming.zhao
@ 2020-11-10  6:36   ` Guoqing Jiang
  2020-11-11 12:13     ` heming.zhao
  1 sibling, 1 reply; 12+ messages in thread
From: Guoqing Jiang @ 2020-11-10  6:36 UTC (permalink / raw)
  To: Zhao Heming, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli



On 11/8/20 15:53, Zhao Heming wrote:
> There is a similar deadlock in commit 0ba959774e93
> ("md-cluster: use sync way to handle METADATA_UPDATED msg")
> 
> This issue is very like 0ba959774e93, except <c>.
> 0ba959774e93 step c is "update sb", this issue is "mdadm --remove"
> 
> ```
> nodeA                       nodeB
> --------------------     --------------------
> a.
> send METADATA_UPDATED
> held token_lockres:EX
>                           b.
>                           md_do_sync
>                            resync_info_update
>                              send RESYNCING
>                               + set MD_CLUSTER_SEND_LOCK
>                               + wait for holding token_lockres:EX
> 
>                           c.
>                           mdadm /dev/md0 --remove /dev/sdg
>                            + held reconfig_mutex
>                            + send REMOVE
>                               + wait_event(MD_CLUSTER_SEND_LOCK)
> 
>                           d.
>                           recv_daemon //METADATA_UPDATED from A
>                            process_metadata_update
>                             + (mddev_trylock(mddev) ||
>                                MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
>                               //this time, both return false forever.
> ```
> 
> Explaination:
> 
> a>
> A send METADATA_UPDATED
> this will block all other nodes to send msg in cluster.
> 
> b>
> B does sync jobs, so it will send RESYNCING at intervals.
> this will be block for holding token_lockres:EX lock.
> ```
> md_do_sync
>   raid1_sync_request
>    resync_info_update
>     sendmsg //with mddev_locked: false
>      lock_comm
>       + wait_event(cinfo->wait,
>       |    !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
>       + lock_token //for step<a>, block holding EX lock
>          + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking
> ```
> c>
> B do "--remove" action and will send REMOVE.
> this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
> ```
> md_ioctl
> + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d>
> + hot_remove_disk
>     remove_disk
>      sendmsg
>       lock_comm
>         wait_event(cinfo->wait,
>           !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking
> ```
> d>
> B recv METADATA_UPDATED msg which send from A in step <a>.
> this will be blocked by step <c>: holding mddev lock, it makes
> wait_event can't hold mddev lock. (btw,
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
> ```
> process_metadata_update
>    wait_event(mddev->thread->wqueue,
>          (got_lock = mddev_trylock(mddev)) ||
>          test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
> ```
> 
> How to fix:
> 
> There are two sides to fix (or break the dead loop):
> 1. on sending msg side, modify lock_comm, change it to return
>     success/failed.
>     This will make mdadm cmd return error when lock_comm is timeout.
> 2. on receiving msg side, process_metadata_update need to add error
>     handling.
>     currently, other msg types won't trigger error or error doesn't need
>     to return sender. So only process_metadata_update need to modify.
> 
> Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side.

It's better if you put the deadlock information (such as the 'D' task 
stack) to the header.

> 
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> ---
>   drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
>   1 file changed, 27 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 4aaf4820b6f6..d59a033e7589 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -523,19 +523,24 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
>   }
>   
>   
> -static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
> +static int process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
>   {
> -	int got_lock = 0;
> +	int got_lock = 0, rv;
>   	struct md_cluster_info *cinfo = mddev->cluster_info;
>   	mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
>   
>   	dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
> -	wait_event(mddev->thread->wqueue,
> -		   (got_lock = mddev_trylock(mddev)) ||
> -		    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
> -	md_reload_sb(mddev, mddev->good_device_nr);
> -	if (got_lock)
> -		mddev_unlock(mddev);
> +	rv = wait_event_timeout(mddev->thread->wqueue,
> +			   (got_lock = mddev_trylock(mddev)) ||
> +			   test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state),
> +			   msecs_to_jiffies(5000));
> +	if (rv) {
> +		md_reload_sb(mddev, mddev->good_device_nr);
> +		if (got_lock)
> +			mddev_unlock(mddev);
> +		return 0;
> +	}
> +	return -1;
>   }
>   
>   static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
> @@ -578,7 +583,7 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg)
>   		return -1;
>   	switch (le32_to_cpu(msg->type)) {
>   	case METADATA_UPDATED:
> -		process_metadata_update(mddev, msg);
> +		ret = process_metadata_update(mddev, msg);
>   		break;
>   	case CHANGE_CAPACITY:
>   		set_capacity(mddev->gendisk, mddev->array_sectors);
> @@ -701,10 +706,15 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
>    */
>   static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
>   {
> -	wait_event(cinfo->wait,
> -		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
> +	int rv;
>   
> -	return lock_token(cinfo, mddev_locked);
> +	rv = wait_event_timeout(cinfo->wait,
> +			   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state),
> +			   msecs_to_jiffies(5000));
> +	if (rv)
> +		return lock_token(cinfo, mddev_locked);
> +	else
> +		return -1;
>   }
>   
>   static void unlock_comm(struct md_cluster_info *cinfo)
> @@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
>   {
>   	int ret;
>   
> -	lock_comm(cinfo, mddev_locked);
> -	ret = __sendmsg(cinfo, cmsg);
> -	unlock_comm(cinfo);
> +	ret = lock_comm(cinfo, mddev_locked);
> +	if (!ret) {
> +		ret = __sendmsg(cinfo, cmsg);
> +		unlock_comm(cinfo);
> +	}
>   	return ret;
>   }
>   

What happens if the cluster has latency issue? Let's say, node normally 
can't get lock during the 5s window. IOW, is the timeout value justified
well?

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
  2020-11-09 18:01   ` Song Liu
@ 2020-11-10  6:38   ` Guoqing Jiang
  2020-11-10  6:59     ` heming.zhao
  1 sibling, 1 reply; 12+ messages in thread
From: Guoqing Jiang @ 2020-11-10  6:38 UTC (permalink / raw)
  To: Zhao Heming, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli



On 11/8/20 15:53, Zhao Heming wrote:
> Test script (reproducible steps):
> ```
> ssh root@node2 "mdadm -S --scan"
> mdadm -S --scan
> mdadm --zero-superblock /dev/sd{g,h,i}
> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
> count=20; done
> 
> echo "mdadm create array"
> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
> echo "set up array on node2"
> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
> 
> sleep 5
> 
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --wait /dev/md0
> mdadm --grow --raid-devices=3 /dev/md0
> 
> mdadm /dev/md0 --fail /dev/sdg
> mdadm /dev/md0 --remove /dev/sdg
>   #mdadm --wait /dev/md0
> mdadm --grow --raid-devices=2 /dev/md0
> ```
> 

What is the result after the above steps? Deadlock or something else.

> node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and
> the disk size is more large the issue is more likely to trigger. (more
> resync time, more easily trigger issues)
> 
> There is a workaround:
> when adding the --wait before second --grow, the issue 1 will disappear.
> 
> Rootcause:
> 
> In cluster env, every node can start resync job even if the resync cmd
> doesn't execute on it.
> e.g.
> There are two node A & B. User executes "mdadm --grow" on A, sometime B
> will start resync job not A.
> 
> Current update_raid_disks() only check local recovery status, it's
> incomplete.
> 
> How to fix:
> 
> The simple & clear solution is block the reshape action in initiator
> side. When node is executing "--grow" and detecting there is ongoing
> resyncing, it should immediately return & report error to user space.
> 
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> ---
>   drivers/md/md.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..74280e353b8f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
>   		return -EINVAL;
>   	if (mddev->sync_thread ||
>   	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
> +		test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
>   	    mddev->reshape_position != MaxSector)
>   		return -EBUSY;
>   
> @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
>   		}
>   	}
>   
> -	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
> -		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
> +		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +		if (ret)
> +			pr_warn("md: updating array disks failed. %d\n", ret);
> +	}
>   
>   	/*
>   	 * Since mddev->delta_disks has already updated in update_raid_disks,
> 

Generally, I think it is good.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-10  6:38   ` Guoqing Jiang
@ 2020-11-10  6:59     ` heming.zhao
  0 siblings, 0 replies; 12+ messages in thread
From: heming.zhao @ 2020-11-10  6:59 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli

On 11/10/20 2:38 PM, Guoqing Jiang wrote:
> 
> 
> On 11/8/20 15:53, Zhao Heming wrote:
>> Test script (reproducible steps):
>> ```
>> ssh root@node2 "mdadm -S --scan"
>> mdadm -S --scan
>> mdadm --zero-superblock /dev/sd{g,h,i}
>> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
>> count=20; done
>>
>> echo "mdadm create array"
>> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
>> echo "set up array on node2"
>> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
>>
>> sleep 5
>>
>> mdadm --manage --add /dev/md0 /dev/sdi
>> mdadm --wait /dev/md0
>> mdadm --grow --raid-devices=3 /dev/md0
>>
>> mdadm /dev/md0 --fail /dev/sdg
>> mdadm /dev/md0 --remove /dev/sdg
>>   #mdadm --wait /dev/md0
>> mdadm --grow --raid-devices=2 /dev/md0
>> ```
>>
> 
> What is the result after the above steps? Deadlock or something else.

The result was writen in cover-letter, in the "*** error behavior ***".
I will add the result as comments in V2 patch.

> 
>> node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and
>> the disk size is more large the issue is more likely to trigger. (more
>> resync time, more easily trigger issues)
>>
>> There is a workaround:
>> when adding the --wait before second --grow, the issue 1 will disappear.
>>
>> ... ...
>> +        if (ret)
>> +            pr_warn("md: updating array disks failed. %d\n", ret);
>> +    }
>>       /*
>>        * Since mddev->delta_disks has already updated in update_raid_disks,
>>
> 
> Generally, I think it is good.
> 
> Thanks,
> Guoqing
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-10  6:36   ` Guoqing Jiang
@ 2020-11-11 12:13     ` heming.zhao
  0 siblings, 0 replies; 12+ messages in thread
From: heming.zhao @ 2020-11-11 12:13 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli

On 11/10/20 2:36 PM, Guoqing Jiang wrote:
> 
> 
> On 11/8/20 15:53, Zhao Heming wrote:
>> There is a similar deadlock in commit 0ba959774e93
>> ("md-cluster: use sync way to handle METADATA_UPDATED msg")
>> ... ...
>>   static void unlock_comm(struct md_cluster_info *cinfo)
>> @@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
>>   {
>>       int ret;
>> -    lock_comm(cinfo, mddev_locked);
>> -    ret = __sendmsg(cinfo, cmsg);
>> -    unlock_comm(cinfo);
>> +    ret = lock_comm(cinfo, mddev_locked);
>> +    if (!ret) {
>> +        ret = __sendmsg(cinfo, cmsg);
>> +        unlock_comm(cinfo);
>> +    }
>>       return ret;
>>   }
> 
> What happens if the cluster has latency issue? Let's say, node normally can't get lock during the 5s window. IOW, is the timeout value justified well?

5 seconds is a random number first from my brain when I was coding. 

below I gave more info for my patch.

The key of patch 2 is to change from wait_event to wait_event_timeout.
After applying patch, code logic of some functions are also changed.
I will analyze the two sides: send & receive

*** send side***

Before applying patch, sendmsg only fail when __sendmsg return error. it means dlm layer fails to convert lock. But currently code should do access the return value of lock_comm.

After applying patch, there will have new fail cases: 5s timeout.

all related functions:
resync_bitmap
- void, only print error info when error. 

update_bitmap_size
- return err. caller already handles error case.

resync_info_update
- return err. caller ignore. because this is info msg, it can safely ignore.

remove_disk
- return err. part of caller handle error case. 
  I forgot to modify hot_remove_disk(), will add it in v2 patch.
  So in future v2 patch, all callers will handle error case.

gather_bitmaps - return err, caller already handles error case

We can see there is only one function which doesn't care return value: resync_bitmap
this function is used in leave path. if the msg doesn't send out (5s timeout), the result is other nodes won't know the failure event by BITMAP_NEEDS_SYNC. But if my understand is correct, even if missing BITMAP_NEEDS_SYNC, there is another api recover_slot(), which is triggered by dlm and do the same job.

So all the sending side related functions are safe.

*** receive side ***

process_metadata_update
the patch means it won't do metadata update job when 5s timeout. 
and I change my mind, 5s timeout is wrong. Receive side should do as more as possible to handle incomming msg. if there is a 5s timeout code branch, there will tigger inconsistent issue.
e.g. 
A does --faulty, send METADATA_UPDATE to B, 
B receives the msg, but meets 5s timeout. it won't to update kernel md info. and it will have a gap between kernel md info and disk metadata.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-11-11 12:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming
2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-09 18:01   ` Song Liu
2020-11-10  6:38   ` Guoqing Jiang
2020-11-10  6:59     ` heming.zhao
2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
2020-11-09  2:02   ` heming.zhao
2020-11-09 18:06     ` Song Liu
2020-11-10  2:24       ` heming.zhao
2020-11-10  6:36   ` Guoqing Jiang
2020-11-11 12:13     ` heming.zhao
2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.