[PATCH v3 0/2] md/cluster bugs fix

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] md/cluster bugs fix
@ 2020-11-15  4:30 Zhao Heming
  2020-11-15  4:30 ` [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
  2020-11-15  4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
  0 siblings, 2 replies; 6+ messages in thread
From: Zhao Heming @ 2020-11-15  4:30 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni
  Cc: Zhao Heming, lidong.zhong, neilb, colyli

Hello List,

There are two patches to fix md-cluster bugs.

The 2 different bugs can use same test script to trigger:

```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

For detail, please check each patch commit log.

-------
v3:
- patch 1/2
  - no change
- patch 2/2
  - use Xiao's solution to fix
  - revise commit log for the "How to fix" part
v2:
- patch 1/2
  - change patch subject
  - add test result in commit log
  - no change for code
- patch 2/2
  - add test result in commit log
  - add error handling of remove_disk in hot_remove_disk
  - add error handling of lock_comm in all caller
  - remove 5s timeout fix in receive side (for process_metadata_update)
v1:
- create patch
-------
Zhao Heming (2):
  md/cluster: reshape should returns error when remote doing resyncing
    job
  md/cluster: fix deadlock when doing reshape job

 drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------
 drivers/md/md.c         | 14 ++++++---
 2 files changed, 49 insertions(+), 34 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-15  4:30 [PATCH v3 0/2] md/cluster bugs fix Zhao Heming
@ 2020-11-15  4:30 ` Zhao Heming
  2020-11-16  9:05   ` Song Liu
  2020-11-15  4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
  1 sibling, 1 reply; 6+ messages in thread
From: Zhao Heming @ 2020-11-15  4:30 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni
  Cc: Zhao Heming, lidong.zhong, neilb, colyli

Test script (reproducible steps):
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
mdadm --zero-superblock /dev/sd{g,h,i}
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and
the disk size is more large the issue is more likely to trigger. (more
resync time, more easily trigger issues)

There is a workaround:
when adding the --wait before second --grow, the issue 1 will disappear.

Rootcause:

In cluster env, every node can start resync job even if the resync cmd
doesn't execute on it.
e.g.
There are two node A & B. User executes "mdadm --grow" on A, sometime B
will start resync job not A.

Current update_raid_disks() only check local recovery status, it's
incomplete.

How to fix:

The simple & clear solution is block the reshape action in initiator
side. When node is executing "--grow" and detecting there is ongoing
resyncing, it should immediately return & report error to user space.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..74280e353b8f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 		return -EINVAL;
 	if (mddev->sync_thread ||
 	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+		test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
 	    mddev->reshape_position != MaxSector)
 		return -EBUSY;
 
@@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}
 
-	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
-		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
+		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+		if (ret)
+			pr_warn("md: updating array disks failed. %d\n", ret);
+	}
 
 	/*
 	 * Since mddev->delta_disks has already updated in update_raid_disks,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job
  2020-11-15  4:30 ` [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
@ 2020-11-16  9:05   ` Song Liu
  0 siblings, 0 replies; 6+ messages in thread
From: Song Liu @ 2020-11-16  9:05 UTC (permalink / raw)
  To: Zhao Heming
  Cc: linux-raid, Guoqing Jiang, Xiao Ni, lidong.zhong, NeilBrown, Coly Li

On Sat, Nov 14, 2020 at 8:30 PM Zhao Heming <heming.zhao@suse.com> wrote:
>
[...]
>
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>

The fix makes sense to me. But I really hope we can improve the commit log.
I have made some changes to it with a couple TODOs for you (see below).
Please read it, fill the TODOs, and revise 2/2.

Thanks,
Song

md/cluster: block reshape with remote resync job

Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see (TODO
describe observed issue).

Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.

The following script reproduces the issue with (TODO:  ???%) probability.
```
# on node1, node2 is the remote node.
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

Cc: <stable@vger.kernel.org>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>

> ---
>  drivers/md/md.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..74280e353b8f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
[...]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-15  4:30 [PATCH v3 0/2] md/cluster bugs fix Zhao Heming
  2020-11-15  4:30 ` [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
@ 2020-11-15  4:30 ` Zhao Heming
  2020-11-15  4:39   ` heming.zhao
  2020-11-17  3:21   ` Xiao Ni
  1 sibling, 2 replies; 6+ messages in thread
From: Zhao Heming @ 2020-11-15  4:30 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni
  Cc: Zhao Heming, lidong.zhong, neilb, colyli

There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
My patch fixed issue is very like commit 0ba959774e93, except <c>.
0ba959774e93 step <c> is "update sb", my fix is "mdadm --remove"

```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever.
```

Explaination:

a>
A send METADATA_UPDATED
this will block all other nodes to send msg in cluster.

b>
B does sync jobs, so it will send RESYNCING at intervals.
this will be block for holding token_lockres:EX lock.
```
md_do_sync
 raid1_sync_request
  resync_info_update
   sendmsg //with mddev_locked: false
    lock_comm
     + wait_event(cinfo->wait,
     |    !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
     + lock_token //for step<a>, block holding EX lock
        + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking
```
c>
B do "--remove" action and will send REMOVE.
this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
```
md_ioctl
 + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d>
 + hot_remove_disk
    remove_disk
     sendmsg //with mddev_locked: true
      lock_comm
       wait_event(cinfo->wait,
         !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking
```
d>
B recv METADATA_UPDATED msg which send from A in step <a>.
this will be blocked by step <c>: holding mddev lock, it makes
wait_event can't hold mddev lock. (btw,
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
```
process_metadata_update
  wait_event(mddev->thread->wqueue,
        (got_lock = mddev_trylock(mddev)) ||
        test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
```

Repro steps:

Test env

node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.

Test script

```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

Test result

test script will hung when executing "mdadm --remove".

```
node1 # ps axj | grep mdadm
1  5423  5227  2231 ?    -1 D   0   0:00 mdadm /dev/md0 --remove /dev/sdg

node1 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdh[1] sdg[0](F)
      1046528 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (1/1046528)
finish=354.0min speed=47K/sec
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

node2 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdg[0](F) sdh[1]
      1046528 blocks super 1.2 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

node1 # echo t > /proc/sysrq-trigger
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

How to fix:

Use the same way of metadata_update_start. lock_comm &
metadata_update_start are two equal users that want to get token lock.
lock_comm could do the same steps like metadata_update_start.
The patch moves setting MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from
lock_token to lock_comm. It enlarge a little bit window of
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock
perfectly.

At last, thanks for Xiao's solution.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------
 drivers/md/md.c         |  6 ++--
 2 files changed, 43 insertions(+), 32 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 4aaf4820b6f6..405bcc5d513e 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -664,9 +664,27 @@ static void recv_daemon(struct md_thread *thread)
  * Takes the lock on the TOKEN lock resource so no other
  * node can communicate while the operation is underway.
  */
-static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
+static int lock_token(struct md_cluster_info *cinfo)
 {
-	int error, set_bit = 0;
+	int error;
+
+	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
+	if (error) {
+		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
+				__func__, __LINE__, error);
+	} else {
+		/* Lock the receive sequence */
+		mutex_lock(&cinfo->recv_mutex);
+	}
+	return error;
+}
+
+/* lock_comm()
+ * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
+ */
+static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
+{
+	int rv, set_bit = 0;
 	struct mddev *mddev = cinfo->mddev;
 
 	/*
@@ -677,34 +695,19 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
 	 */
 	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
 				      &cinfo->state)) {
-		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+		rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
 					      &cinfo->state);
-		WARN_ON_ONCE(error);
+		WARN_ON_ONCE(rv);
 		md_wakeup_thread(mddev->thread);
 		set_bit = 1;
 	}
-	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
-	if (set_bit)
-		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 
-	if (error)
-		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
-				__func__, __LINE__, error);
-
-	/* Lock the receive sequence */
-	mutex_lock(&cinfo->recv_mutex);
-	return error;
-}
-
-/* lock_comm()
- * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
- */
-static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
-{
 	wait_event(cinfo->wait,
-		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
-
-	return lock_token(cinfo, mddev_locked);
+			   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
+	rv = lock_token(cinfo);
+	if (set_bit)
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+	return rv;
 }
 
 static void unlock_comm(struct md_cluster_info *cinfo)
@@ -784,9 +787,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
 {
 	int ret;
 
-	lock_comm(cinfo, mddev_locked);
-	ret = __sendmsg(cinfo, cmsg);
-	unlock_comm(cinfo);
+	ret = lock_comm(cinfo, mddev_locked);
+	if (!ret) {
+		ret = __sendmsg(cinfo, cmsg);
+		unlock_comm(cinfo);
+	}
 	return ret;
 }
 
@@ -1061,7 +1066,7 @@ static int metadata_update_start(struct mddev *mddev)
 		return 0;
 	}
 
-	ret = lock_token(cinfo, 1);
+	ret = lock_token(cinfo);
 	clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 	return ret;
 }
@@ -1255,7 +1260,10 @@ static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
 	int raid_slot = -1;
 
 	md_update_sb(mddev, 1);
-	lock_comm(cinfo, 1);
+	if (lock_comm(cinfo, 1)) {
+		pr_err("%s: lock_comm failed\n", __func__);
+		return;
+	}
 
 	memset(&cmsg, 0, sizeof(cmsg));
 	cmsg.type = cpu_to_le32(METADATA_UPDATED);
@@ -1407,7 +1415,8 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
 	cmsg.type = cpu_to_le32(NEWDISK);
 	memcpy(cmsg.uuid, uuid, 16);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	lock_comm(cinfo, 1);
+	if (lock_comm(cinfo, 1))
+		return -EAGAIN;
 	ret = __sendmsg(cinfo, &cmsg);
 	if (ret) {
 		unlock_comm(cinfo);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 74280e353b8f..46da165afde2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6948,8 +6948,10 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
 		goto busy;
 
 kick_rdev:
-	if (mddev_is_clustered(mddev))
-		md_cluster_ops->remove_disk(mddev, rdev);
+	if (mddev_is_clustered(mddev)) {
+		if (md_cluster_ops->remove_disk(mddev, rdev))
+			goto busy;
+	}
 
 	md_kick_rdev_from_array(rdev);
 	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-15  4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
@ 2020-11-15  4:39   ` heming.zhao
  2020-11-17  3:21   ` Xiao Ni
  1 sibling, 0 replies; 6+ messages in thread
From: heming.zhao @ 2020-11-15  4:39 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni; +Cc: lidong.zhong, neilb, colyli

On 11/15/20 12:30 PM, Zhao Heming wrote:
> ... ...
> How to fix:
> 
> Use the same way of metadata_update_start. lock_comm &
> metadata_update_start are two equal users that want to get token lock.
> lock_comm could do the same steps like metadata_update_start.
> The patch moves setting MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from
> lock_token to lock_comm. It enlarge a little bit window of
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock
> perfectly.
> 
> At last, thanks for Xiao's solution.
> 
> 

for easy review, I paste key fix code from v3 patch.

legacy code (original upstream code):

```
static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
{
    int error, set_bit = 0;
    struct mddev *mddev = cinfo->mddev;

    /*
     * If resync thread run after raid1d thread, then process_metadata_update
     * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
     * since another node already got EX on Token and waitting the EX of Ack),
     * so let resync wake up thread in case flag is set.
     */
    if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
                      &cinfo->state)) {
        error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
                          &cinfo->state);
        WARN_ON_ONCE(error);
        md_wakeup_thread(mddev->thread);
        set_bit = 1;
    }
    error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
    if (set_bit)
        clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);

    if (error)
        pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
                __func__, __LINE__, error);

    /* Lock the receive sequence */
    mutex_lock(&cinfo->recv_mutex);
    return error;
}

/* lock_comm()
 * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
 */
static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
{
    wait_event(cinfo->wait,
           !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));

    return lock_token(cinfo, mddev_locked);
}
```

v3 patch:
```
static int lock_token(struct md_cluster_info *cinfo)
{
    int error;

    error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
    if (error) {
        pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
                __func__, __LINE__, error);
    } else {
        /* Lock the receive sequence */
        mutex_lock(&cinfo->recv_mutex);
    }
    return error;
}

/* lock_comm()
 * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
 */
static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
{
    int rv, set_bit = 0;
    struct mddev *mddev = cinfo->mddev;

    /*
     * If resync thread run after raid1d thread, then process_metadata_update
     * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
     * since another node already got EX on Token and waitting the EX of Ack),
     * so let resync wake up thread in case flag is set.
     */
    if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
                      &cinfo->state)) {
        rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
                          &cinfo->state);
        WARN_ON_ONCE(rv);
        md_wakeup_thread(mddev->thread);
        set_bit = 1;
    }

    wait_event(cinfo->wait,
               !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
    rv = lock_token(cinfo);
    if (set_bit)
        clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
    return rv;
}
```


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job
  2020-11-15  4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
  2020-11-15  4:39   ` heming.zhao
@ 2020-11-17  3:21   ` Xiao Ni
  1 sibling, 0 replies; 6+ messages in thread
From: Xiao Ni @ 2020-11-17  3:21 UTC (permalink / raw)
  To: Zhao Heming, linux-raid, song, guoqing.jiang; +Cc: lidong.zhong, neilb, colyli

Hi Heming

For the code part, it's good for me. For the comments part, please modify
as Song's requested.

Reviewed-by: Xiao Ni <xni@redhat.com>

On 11/15/2020 12:30 PM, Zhao Heming wrote:
> There is a similar deadlock in commit 0ba959774e93
> ("md-cluster: use sync way to handle METADATA_UPDATED msg")
> My patch fixed issue is very like commit 0ba959774e93, except <c>.
> 0ba959774e93 step <c> is "update sb", my fix is "mdadm --remove"
>
> ```
> nodeA                       nodeB
> --------------------     --------------------
> a.
> send METADATA_UPDATED
> held token_lockres:EX
>                           b.
>                           md_do_sync
>                            resync_info_update
>                              send RESYNCING
>                               + set MD_CLUSTER_SEND_LOCK
>                               + wait for holding token_lockres:EX
>
>                           c.
>                           mdadm /dev/md0 --remove /dev/sdg
>                            + held reconfig_mutex
>                            + send REMOVE
>                               + wait_event(MD_CLUSTER_SEND_LOCK)
>
>                           d.
>                           recv_daemon //METADATA_UPDATED from A
>                            process_metadata_update
>                             + (mddev_trylock(mddev) ||
>                                MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
>                               //this time, both return false forever.
> ```
>
> Explaination:
>
> a>
> A send METADATA_UPDATED
> this will block all other nodes to send msg in cluster.
>
> b>
> B does sync jobs, so it will send RESYNCING at intervals.
> this will be block for holding token_lockres:EX lock.
> ```
> md_do_sync
>   raid1_sync_request
>    resync_info_update
>     sendmsg //with mddev_locked: false
>      lock_comm
>       + wait_event(cinfo->wait,
>       |    !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
>       + lock_token //for step<a>, block holding EX lock
>          + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking
> ```
> c>
> B do "--remove" action and will send REMOVE.
> this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
> ```
> md_ioctl
>   + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d>
>   + hot_remove_disk
>      remove_disk
>       sendmsg //with mddev_locked: true
>        lock_comm
>         wait_event(cinfo->wait,
>           !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking
> ```
> d>
> B recv METADATA_UPDATED msg which send from A in step <a>.
> this will be blocked by step <c>: holding mddev lock, it makes
> wait_event can't hold mddev lock. (btw,
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
> ```
> process_metadata_update
>    wait_event(mddev->thread->wqueue,
>          (got_lock = mddev_trylock(mddev)) ||
>          test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
> ```
>
> Repro steps:
>
> Test env
>
> node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
>
> Test script
>
> ```
> ssh root@node2 "mdadm -S --scan"
> mdadm -S --scan
> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
> count=20; done
>
> echo "mdadm create array"
> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
>   --bitmap-chunk=1M
> echo "set up array on node2"
> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
>
> sleep 5
>
> mkfs.xfs /dev/md0
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --wait /dev/md0
> mdadm --grow --raid-devices=3 /dev/md0
>
> mdadm /dev/md0 --fail /dev/sdg
> mdadm /dev/md0 --remove /dev/sdg
> mdadm --grow --raid-devices=2 /dev/md0
> ```
>
> Test result
>
> test script will hung when executing "mdadm --remove".
>
> ```
> node1 # ps axj | grep mdadm
> 1  5423  5227  2231 ?    -1 D   0   0:00 mdadm /dev/md0 --remove /dev/sdg
>
> node1 # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdi[2] sdh[1] sdg[0](F)
>        1046528 blocks super 1.2 [2/1] [_U]
>        [>....................]  recovery =  0.0% (1/1046528)
> finish=354.0min speed=47K/sec
>        bitmap: 1/1 pages [4KB], 1024KB chunk
>
> unused devices: <none>
>
> node2 # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdi[2] sdg[0](F) sdh[1]
>        1046528 blocks super 1.2 [2/1] [_U]
>        bitmap: 1/1 pages [4KB], 1024KB chunk
>
> unused devices: <none>
>
> node1 # echo t > /proc/sysrq-trigger
> md0_cluster_rec D    0  5329      2 0x80004000
> Call Trace:
>   __schedule+0x1f6/0x560
>   ? _cond_resched+0x2d/0x40
>   ? schedule+0x4a/0xb0
>   ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
>   ? wait_woken+0x80/0x80
>   ? process_recvd_msg+0x113/0x1d0 [md_cluster]
>   ? recv_daemon+0x9e/0x120 [md_cluster]
>   ? md_thread+0x94/0x160 [md_mod]
>   ? wait_woken+0x80/0x80
>   ? md_congested+0x30/0x30 [md_mod]
>   ? kthread+0x115/0x140
>   ? __kthread_bind_mask+0x60/0x60
>   ? ret_from_fork+0x1f/0x40
>
> mdadm           D    0  5423      1 0x00004004
> Call Trace:
>   __schedule+0x1f6/0x560
>   ? __schedule+0x1fe/0x560
>   ? schedule+0x4a/0xb0
>   ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
>   ? wait_woken+0x80/0x80
>   ? remove_disk+0x4f/0x90 [md_cluster]
>   ? hot_remove_disk+0xb1/0x1b0 [md_mod]
>   ? md_ioctl+0x50c/0xba0 [md_mod]
>   ? wait_woken+0x80/0x80
>   ? blkdev_ioctl+0xa2/0x2a0
>   ? block_ioctl+0x39/0x40
>   ? ksys_ioctl+0x82/0xc0
>   ? __x64_sys_ioctl+0x16/0x20
>   ? do_syscall_64+0x5f/0x150
>   ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> md0_resync      D    0  5425      2 0x80004000
> Call Trace:
>   __schedule+0x1f6/0x560
>   ? schedule+0x4a/0xb0
>   ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
>   ? wait_woken+0x80/0x80
>   ? lock_token+0x2d/0x90 [md_cluster]
>   ? resync_info_update+0x95/0x100 [md_cluster]
>   ? raid1_sync_request+0x7d3/0xa40 [raid1]
>   ? md_do_sync.cold+0x737/0xc8f [md_mod]
>   ? md_thread+0x94/0x160 [md_mod]
>   ? md_congested+0x30/0x30 [md_mod]
>   ? kthread+0x115/0x140
>   ? __kthread_bind_mask+0x60/0x60
>   ? ret_from_fork+0x1f/0x40
> ```
>
> How to fix:
>
> Use the same way of metadata_update_start. lock_comm &
> metadata_update_start are two equal users that want to get token lock.
> lock_comm could do the same steps like metadata_update_start.
> The patch moves setting MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from
> lock_token to lock_comm. It enlarge a little bit window of
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock
> perfectly.
>
> At last, thanks for Xiao's solution.
>
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> ---
>   drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------
>   drivers/md/md.c         |  6 ++--
>   2 files changed, 43 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 4aaf4820b6f6..405bcc5d513e 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -664,9 +664,27 @@ static void recv_daemon(struct md_thread *thread)
>    * Takes the lock on the TOKEN lock resource so no other
>    * node can communicate while the operation is underway.
>    */
> -static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
> +static int lock_token(struct md_cluster_info *cinfo)
>   {
> -	int error, set_bit = 0;
> +	int error;
> +
> +	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
> +	if (error) {
> +		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
> +				__func__, __LINE__, error);
> +	} else {
> +		/* Lock the receive sequence */
> +		mutex_lock(&cinfo->recv_mutex);
> +	}
> +	return error;
> +}
> +
> +/* lock_comm()
> + * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
> + */
> +static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
> +{
> +	int rv, set_bit = 0;
>   	struct mddev *mddev = cinfo->mddev;
>   
>   	/*
> @@ -677,34 +695,19 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
>   	 */
>   	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
>   				      &cinfo->state)) {
> -		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> +		rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
>   					      &cinfo->state);
> -		WARN_ON_ONCE(error);
> +		WARN_ON_ONCE(rv);
>   		md_wakeup_thread(mddev->thread);
>   		set_bit = 1;
>   	}
> -	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
> -	if (set_bit)
> -		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
>   
> -	if (error)
> -		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
> -				__func__, __LINE__, error);
> -
> -	/* Lock the receive sequence */
> -	mutex_lock(&cinfo->recv_mutex);
> -	return error;
> -}
> -
> -/* lock_comm()
> - * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
> - */
> -static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
> -{
>   	wait_event(cinfo->wait,
> -		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
> -
> -	return lock_token(cinfo, mddev_locked);
> +			   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
> +	rv = lock_token(cinfo);
> +	if (set_bit)
> +		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
> +	return rv;
>   }
>   
>   static void unlock_comm(struct md_cluster_info *cinfo)
> @@ -784,9 +787,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
>   {
>   	int ret;
>   
> -	lock_comm(cinfo, mddev_locked);
> -	ret = __sendmsg(cinfo, cmsg);
> -	unlock_comm(cinfo);
> +	ret = lock_comm(cinfo, mddev_locked);
> +	if (!ret) {
> +		ret = __sendmsg(cinfo, cmsg);
> +		unlock_comm(cinfo);
> +	}
>   	return ret;
>   }
>   
> @@ -1061,7 +1066,7 @@ static int metadata_update_start(struct mddev *mddev)
>   		return 0;
>   	}
>   
> -	ret = lock_token(cinfo, 1);
> +	ret = lock_token(cinfo);
>   	clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
>   	return ret;
>   }
> @@ -1255,7 +1260,10 @@ static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
>   	int raid_slot = -1;
>   
>   	md_update_sb(mddev, 1);
> -	lock_comm(cinfo, 1);
> +	if (lock_comm(cinfo, 1)) {
> +		pr_err("%s: lock_comm failed\n", __func__);
> +		return;
> +	}
>   
>   	memset(&cmsg, 0, sizeof(cmsg));
>   	cmsg.type = cpu_to_le32(METADATA_UPDATED);
> @@ -1407,7 +1415,8 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
>   	cmsg.type = cpu_to_le32(NEWDISK);
>   	memcpy(cmsg.uuid, uuid, 16);
>   	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
> -	lock_comm(cinfo, 1);
> +	if (lock_comm(cinfo, 1))
> +		return -EAGAIN;
>   	ret = __sendmsg(cinfo, &cmsg);
>   	if (ret) {
>   		unlock_comm(cinfo);
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 74280e353b8f..46da165afde2 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6948,8 +6948,10 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
>   		goto busy;
>   
>   kick_rdev:
> -	if (mddev_is_clustered(mddev))
> -		md_cluster_ops->remove_disk(mddev, rdev);
> +	if (mddev_is_clustered(mddev)) {
> +		if (md_cluster_ops->remove_disk(mddev, rdev))
> +			goto busy;
> +	}
>   
>   	md_kick_rdev_from_array(rdev);
>   	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-11-17  3:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-15  4:30 [PATCH v3 0/2] md/cluster bugs fix Zhao Heming
2020-11-15  4:30 ` [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-16  9:05   ` Song Liu
2020-11-15  4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
2020-11-15  4:39   ` heming.zhao
2020-11-17  3:21   ` Xiao Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).