[PATCH v4 0/2] md/cluster bugs fix

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/2] md/cluster bugs fix
@ 2020-11-18 16:45 Zhao Heming
  2020-11-18 16:45 ` [PATCH v4 1/2] md/cluster: block reshape with remote resync job Zhao Heming
  2020-11-18 16:45 ` [PATCH v4 2/2] md/cluster: fix deadlock when node is doing " Zhao Heming
  0 siblings, 2 replies; 7+ messages in thread
From: Zhao Heming @ 2020-11-18 16:45 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni
  Cc: Zhao Heming, lidong.zhong, neilb, colyli, stable

Hello List,

There are two patches to fix md-cluster bugs.

The 2 different bugs can use same test script to trigger:

```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

For detail, please check each patch commit log.

-------
v4:
- revise subject & commit log on both patches
- no change for code
v3:
- patch 1/2
  - no change
- patch 2/2
  - use Xiao's solution to fix
  - revise commit log for the "How to fix" part
v2:
- patch 1/2
  - change patch subject
  - add test result in commit log
  - no change for code
- patch 2/2
  - add test result in commit log
  - add error handling of remove_disk in hot_remove_disk
  - add error handling of lock_comm in all caller
  - remove 5s timeout fix in receive side (for process_metadata_update)
v1:
- create patch
-------
Zhao Heming (2):
  md/cluster: block reshape with remote resync job
  md/cluster: fix deadlock when node is doing resync job

 drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------
 drivers/md/md.c         | 14 ++++++---
 2 files changed, 49 insertions(+), 34 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 1/2] md/cluster: block reshape with remote resync job
  2020-11-18 16:45 [PATCH v4 0/2] md/cluster bugs fix Zhao Heming
@ 2020-11-18 16:45 ` Zhao Heming
  2020-11-18 17:14   ` Greg KH
  2020-11-18 16:45 ` [PATCH v4 2/2] md/cluster: fix deadlock when node is doing " Zhao Heming
  1 sibling, 1 reply; 7+ messages in thread
From: Zhao Heming @ 2020-11-18 16:45 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni
  Cc: Zhao Heming, lidong.zhong, neilb, colyli, stable

Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see user will
execute "mdadm --grow" successfully on local, while the remote node deny
to do reshape job when it doing resync job. The inconsistent handling
cause array enter unexpected status. If user doesn't observe this issue
and continue executing mdadm cmd, the array doesn't work at last.

Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.

The following script reproduces the issue with ~100% probability.
(two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
```
 # on node1, node2 is the remote node.
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..74280e353b8f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 		return -EINVAL;
 	if (mddev->sync_thread ||
 	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+	    test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
 	    mddev->reshape_position != MaxSector)
 		return -EBUSY;

@@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}

-	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
-		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
+		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+		if (ret)
+			pr_warn("md: updating array disks failed. %d\n", ret);
+	}

 	/*
 	 * Since mddev->delta_disks has already updated in update_raid_disks,
-- 
2.27.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 2/2] md/cluster: fix deadlock when node is doing resync job
  2020-11-18 16:45 [PATCH v4 0/2] md/cluster bugs fix Zhao Heming
  2020-11-18 16:45 ` [PATCH v4 1/2] md/cluster: block reshape with remote resync job Zhao Heming
@ 2020-11-18 16:45 ` Zhao Heming
  2020-11-18 16:49   ` heming.zhao
  2020-11-18 17:14   ` Greg KH
  1 sibling, 2 replies; 7+ messages in thread
From: Zhao Heming @ 2020-11-18 16:45 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni
  Cc: Zhao Heming, lidong.zhong, neilb, colyli, stable

md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
   This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
   This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
   This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
   This will be blocked by step <c>: holding mddev lock, it makes
   wait_event can't hold mddev lock. (btw,
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
 # dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------
 drivers/md/md.c         |  6 ++--
 2 files changed, 43 insertions(+), 32 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 4aaf4820b6f6..405bcc5d513e 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -664,9 +664,27 @@ static void recv_daemon(struct md_thread *thread)
  * Takes the lock on the TOKEN lock resource so no other
  * node can communicate while the operation is underway.
  */
-static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
+static int lock_token(struct md_cluster_info *cinfo)
 {
-	int error, set_bit = 0;
+	int error;
+
+	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
+	if (error) {
+		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
+				__func__, __LINE__, error);
+	} else {
+		/* Lock the receive sequence */
+		mutex_lock(&cinfo->recv_mutex);
+	}
+	return error;
+}
+
+/* lock_comm()
+ * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
+ */
+static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
+{
+	int rv, set_bit = 0;
 	struct mddev *mddev = cinfo->mddev;
 
 	/*
@@ -677,34 +695,19 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
 	 */
 	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
 				      &cinfo->state)) {
-		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+		rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
 					      &cinfo->state);
-		WARN_ON_ONCE(error);
+		WARN_ON_ONCE(rv);
 		md_wakeup_thread(mddev->thread);
 		set_bit = 1;
 	}
-	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
-	if (set_bit)
-		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 
-	if (error)
-		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
-				__func__, __LINE__, error);
-
-	/* Lock the receive sequence */
-	mutex_lock(&cinfo->recv_mutex);
-	return error;
-}
-
-/* lock_comm()
- * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
- */
-static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
-{
 	wait_event(cinfo->wait,
-		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
-
-	return lock_token(cinfo, mddev_locked);
+			   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
+	rv = lock_token(cinfo);
+	if (set_bit)
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+	return rv;
 }
 
 static void unlock_comm(struct md_cluster_info *cinfo)
@@ -784,9 +787,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
 {
 	int ret;
 
-	lock_comm(cinfo, mddev_locked);
-	ret = __sendmsg(cinfo, cmsg);
-	unlock_comm(cinfo);
+	ret = lock_comm(cinfo, mddev_locked);
+	if (!ret) {
+		ret = __sendmsg(cinfo, cmsg);
+		unlock_comm(cinfo);
+	}
 	return ret;
 }
 
@@ -1061,7 +1066,7 @@ static int metadata_update_start(struct mddev *mddev)
 		return 0;
 	}
 
-	ret = lock_token(cinfo, 1);
+	ret = lock_token(cinfo);
 	clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 	return ret;
 }
@@ -1255,7 +1260,10 @@ static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
 	int raid_slot = -1;
 
 	md_update_sb(mddev, 1);
-	lock_comm(cinfo, 1);
+	if (lock_comm(cinfo, 1)) {
+		pr_err("%s: lock_comm failed\n", __func__);
+		return;
+	}
 
 	memset(&cmsg, 0, sizeof(cmsg));
 	cmsg.type = cpu_to_le32(METADATA_UPDATED);
@@ -1407,7 +1415,8 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
 	cmsg.type = cpu_to_le32(NEWDISK);
 	memcpy(cmsg.uuid, uuid, 16);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	lock_comm(cinfo, 1);
+	if (lock_comm(cinfo, 1))
+		return -EAGAIN;
 	ret = __sendmsg(cinfo, &cmsg);
 	if (ret) {
 		unlock_comm(cinfo);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 74280e353b8f..46da165afde2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6948,8 +6948,10 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
 		goto busy;
 
 kick_rdev:
-	if (mddev_is_clustered(mddev))
-		md_cluster_ops->remove_disk(mddev, rdev);
+	if (mddev_is_clustered(mddev)) {
+		if (md_cluster_ops->remove_disk(mddev, rdev))
+			goto busy;
+	}
 
 	md_kick_rdev_from_array(rdev);
 	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/2] md/cluster: fix deadlock when node is doing resync job
  2020-11-18 16:45 ` [PATCH v4 2/2] md/cluster: fix deadlock when node is doing " Zhao Heming
@ 2020-11-18 16:49   ` heming.zhao
  2020-11-18 17:14   ` Greg KH
  1 sibling, 0 replies; 7+ messages in thread
From: heming.zhao @ 2020-11-18 16:49 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni; +Cc: lidong.zhong, neilb, colyli, stable

On 11/19/20 12:45 AM, Zhao Heming wrote:
> md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
> During sending msg, node can concurrently receive msg from another node.
> [... ...]
> 
> Repro steps (I only triggered 3 times with hundreds tests):

sorry to send v4 patch so late.
I spent more than 2 days to run test script to trigger the deadlock.
The result is I failed.
So I wrote "I only triggered 3 times with hundreds tests" in commit log.

> 
> two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
> ```
> ssh root@node2 "mdadm -S --scan"
> [... ...]
> 
> At last, thanks for Xiao's solution.
> 
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> Suggested-by: Xiao Ni <xni@redhat.com>
> Reviewed-by: Xiao Ni <xni@redhat.com>
> ---
>   drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------
>   drivers/md/md.c         |  6 ++--
>   2 files changed, 43 insertions(+), 32 deletions(-)
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 1/2] md/cluster: block reshape with remote resync job
  2020-11-18 16:45 ` [PATCH v4 1/2] md/cluster: block reshape with remote resync job Zhao Heming
@ 2020-11-18 17:14   ` Greg KH
  0 siblings, 0 replies; 7+ messages in thread
From: Greg KH @ 2020-11-18 17:14 UTC (permalink / raw)
  To: Zhao Heming
  Cc: linux-raid, song, guoqing.jiang, xni, lidong.zhong, neilb,
	colyli, stable

On Thu, Nov 19, 2020 at 12:45:53AM +0800, Zhao Heming wrote:
> Reshape request should be blocked with ongoing resync job. In cluster
> env, a node can start resync job even if the resync cmd isn't executed
> on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
> will start resync job. However, current update_raid_disks() only check
> local recovery status, which is incomplete. As a result, we see user will
> execute "mdadm --grow" successfully on local, while the remote node deny
> to do reshape job when it doing resync job. The inconsistent handling
> cause array enter unexpected status. If user doesn't observe this issue
> and continue executing mdadm cmd, the array doesn't work at last.
> 
> Fix this issue by blocking reshape request. When node executes "--grow"
> and detects ongoing resync, it should stop and report error to user.
> 
> The following script reproduces the issue with ~100% probability.
> (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
> ```
>  # on node1, node2 is the remote node.
> ssh root@node2 "mdadm -S --scan"
> mdadm -S --scan
> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
> count=20; done
> 
> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
> 
> sleep 5
> 
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --wait /dev/md0
> mdadm --grow --raid-devices=3 /dev/md0
> 
> mdadm /dev/md0 --fail /dev/sdg
> mdadm /dev/md0 --remove /dev/sdg
> mdadm --grow --raid-devices=2 /dev/md0
> ```
> 
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> ---
>  drivers/md/md.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/2] md/cluster: fix deadlock when node is doing resync job
  2020-11-18 16:45 ` [PATCH v4 2/2] md/cluster: fix deadlock when node is doing " Zhao Heming
  2020-11-18 16:49   ` heming.zhao
@ 2020-11-18 17:14   ` Greg KH
  1 sibling, 0 replies; 7+ messages in thread
From: Greg KH @ 2020-11-18 17:14 UTC (permalink / raw)
  To: Zhao Heming
  Cc: linux-raid, song, guoqing.jiang, xni, lidong.zhong, neilb,
	colyli, stable

On Thu, Nov 19, 2020 at 12:45:54AM +0800, Zhao Heming wrote:
> md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
> During sending msg, node can concurrently receive msg from another node.
> When node does resync job, grab token_lockres:EX may trigger a deadlock:
> ```
> nodeA                       nodeB
> --------------------     --------------------
> a.
> send METADATA_UPDATED
> held token_lockres:EX
>                          b.
>                          md_do_sync
>                           resync_info_update
>                             send RESYNCING
>                              + set MD_CLUSTER_SEND_LOCK
>                              + wait for holding token_lockres:EX
> 
>                          c.
>                          mdadm /dev/md0 --remove /dev/sdg
>                           + held reconfig_mutex
>                           + send REMOVE
>                              + wait_event(MD_CLUSTER_SEND_LOCK)
> 
>                          d.
>                          recv_daemon //METADATA_UPDATED from A
>                           process_metadata_update
>                            + (mddev_trylock(mddev) ||
>                               MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
>                              //this time, both return false forever
> ```
> Explaination:
> a. A send METADATA_UPDATED
>    This will block another node to send msg
> 
> b. B does sync jobs, which will send RESYNCING at intervals.
>    This will be block for holding token_lockres:EX lock.
> 
> c. B do "mdadm --remove", which will send REMOVE.
>    This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
> 
> d. B recv METADATA_UPDATED msg, which send from A in step <a>.
>    This will be blocked by step <c>: holding mddev lock, it makes
>    wait_event can't hold mddev lock. (btw,
>    MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
> 
> There is a similar deadlock in commit 0ba959774e93
> ("md-cluster: use sync way to handle METADATA_UPDATED msg")
> In that commit, step c is "update sb". This patch step c is
> "mdadm --remove".
> 
> For fixing this issue, we can refer the solution of function:
> metadata_update_start. Which does the same grab lock_token action.
> lock_comm can use the same steps to avoid deadlock. By moving
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
> It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> but it is safe & can break deadlock.
> 
> Repro steps (I only triggered 3 times with hundreds tests):
> 
> two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
> ```
> ssh root@node2 "mdadm -S --scan"
> mdadm -S --scan
> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
> count=20; done
> 
> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
>  --bitmap-chunk=1M
> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
> 
> sleep 5
> 
> mkfs.xfs /dev/md0
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --wait /dev/md0
> mdadm --grow --raid-devices=3 /dev/md0
> 
> mdadm /dev/md0 --fail /dev/sdg
> mdadm /dev/md0 --remove /dev/sdg
> mdadm --grow --raid-devices=2 /dev/md0
> ```
> 
> test script will hung when executing "mdadm --remove".
> 
> ```
>  # dump stacks by "echo t > /proc/sysrq-trigger"
> md0_cluster_rec D    0  5329      2 0x80004000
> Call Trace:
>  __schedule+0x1f6/0x560
>  ? _cond_resched+0x2d/0x40
>  ? schedule+0x4a/0xb0
>  ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
>  ? wait_woken+0x80/0x80
>  ? process_recvd_msg+0x113/0x1d0 [md_cluster]
>  ? recv_daemon+0x9e/0x120 [md_cluster]
>  ? md_thread+0x94/0x160 [md_mod]
>  ? wait_woken+0x80/0x80
>  ? md_congested+0x30/0x30 [md_mod]
>  ? kthread+0x115/0x140
>  ? __kthread_bind_mask+0x60/0x60
>  ? ret_from_fork+0x1f/0x40
> 
> mdadm           D    0  5423      1 0x00004004
> Call Trace:
>  __schedule+0x1f6/0x560
>  ? __schedule+0x1fe/0x560
>  ? schedule+0x4a/0xb0
>  ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
>  ? wait_woken+0x80/0x80
>  ? remove_disk+0x4f/0x90 [md_cluster]
>  ? hot_remove_disk+0xb1/0x1b0 [md_mod]
>  ? md_ioctl+0x50c/0xba0 [md_mod]
>  ? wait_woken+0x80/0x80
>  ? blkdev_ioctl+0xa2/0x2a0
>  ? block_ioctl+0x39/0x40
>  ? ksys_ioctl+0x82/0xc0
>  ? __x64_sys_ioctl+0x16/0x20
>  ? do_syscall_64+0x5f/0x150
>  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> md0_resync      D    0  5425      2 0x80004000
> Call Trace:
>  __schedule+0x1f6/0x560
>  ? schedule+0x4a/0xb0
>  ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
>  ? wait_woken+0x80/0x80
>  ? lock_token+0x2d/0x90 [md_cluster]
>  ? resync_info_update+0x95/0x100 [md_cluster]
>  ? raid1_sync_request+0x7d3/0xa40 [raid1]
>  ? md_do_sync.cold+0x737/0xc8f [md_mod]
>  ? md_thread+0x94/0x160 [md_mod]
>  ? md_congested+0x30/0x30 [md_mod]
>  ? kthread+0x115/0x140
>  ? __kthread_bind_mask+0x60/0x60
>  ? ret_from_fork+0x1f/0x40
> ```
> 
> At last, thanks for Xiao's solution.
> 
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> Suggested-by: Xiao Ni <xni@redhat.com>
> Reviewed-by: Xiao Ni <xni@redhat.com>
> ---
>  drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------
>  drivers/md/md.c         |  6 ++--
>  2 files changed, 43 insertions(+), 32 deletions(-)

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 1/2] md/cluster: block reshape with remote resync job
       [not found] <1605786094-5582-1-git-send-email-heming.zhao@suse.com>
@ 2020-11-19 11:41 ` Zhao Heming
  0 siblings, 0 replies; 7+ messages in thread
From: Zhao Heming @ 2020-11-19 11:41 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang, xni
  Cc: Zhao Heming, lidong.zhong, neilb, colyli, stable

Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see user will
execute "mdadm --grow" successfully on local, while the remote node deny
to do reshape job when it doing resync job. The inconsistent handling
cause array enter unexpected status. If user doesn't observe this issue
and continue executing mdadm cmd, the array doesn't work at last.

Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.

The following script reproduces the issue with ~100% probability.
(two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
```
 # on node1, node2 is the remote node.
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..74280e353b8f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 		return -EINVAL;
 	if (mddev->sync_thread ||
 	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+	    test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
 	    mddev->reshape_position != MaxSector)
 		return -EBUSY;

@@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}

-	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
-		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
+		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+		if (ret)
+			pr_warn("md: updating array disks failed. %d\n", ret);
+	}

 	/*
 	 * Since mddev->delta_disks has already updated in update_raid_disks,
-- 
2.27.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-11-19 11:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18 16:45 [PATCH v4 0/2] md/cluster bugs fix Zhao Heming
2020-11-18 16:45 ` [PATCH v4 1/2] md/cluster: block reshape with remote resync job Zhao Heming
2020-11-18 17:14   ` Greg KH
2020-11-18 16:45 ` [PATCH v4 2/2] md/cluster: fix deadlock when node is doing " Zhao Heming
2020-11-18 16:49   ` heming.zhao
2020-11-18 17:14   ` Greg KH
     [not found] <1605786094-5582-1-git-send-email-heming.zhao@suse.com>
2020-11-19 11:41 ` [PATCH v4 1/2] md/cluster: block reshape with remote " Zhao Heming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).