* [PATCH v3 0/2] md/cluster bugs fix @ 2020-11-15 4:30 Zhao Heming 2020-11-15 4:30 ` [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming 2020-11-15 4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming 0 siblings, 2 replies; 6+ messages in thread From: Zhao Heming @ 2020-11-15 4:30 UTC (permalink / raw) To: linux-raid, song, guoqing.jiang, xni Cc: Zhao Heming, lidong.zhong, neilb, colyli Hello List, There are two patches to fix md-cluster bugs. The 2 different bugs can use same test script to trigger: ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg #mdadm --wait /dev/md0 mdadm --grow --raid-devices=2 /dev/md0 ``` For detail, please check each patch commit log. ------- v3: - patch 1/2 - no change - patch 2/2 - use Xiao's solution to fix - revise commit log for the "How to fix" part v2: - patch 1/2 - change patch subject - add test result in commit log - no change for code - patch 2/2 - add test result in commit log - add error handling of remove_disk in hot_remove_disk - add error handling of lock_comm in all caller - remove 5s timeout fix in receive side (for process_metadata_update) v1: - create patch ------- Zhao Heming (2): md/cluster: reshape should returns error when remote doing resyncing job md/cluster: fix deadlock when doing reshape job drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------ drivers/md/md.c | 14 ++++++--- 2 files changed, 49 insertions(+), 34 deletions(-) -- 2.27.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job 2020-11-15 4:30 [PATCH v3 0/2] md/cluster bugs fix Zhao Heming @ 2020-11-15 4:30 ` Zhao Heming 2020-11-16 9:05 ` Song Liu 2020-11-15 4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming 1 sibling, 1 reply; 6+ messages in thread From: Zhao Heming @ 2020-11-15 4:30 UTC (permalink / raw) To: linux-raid, song, guoqing.jiang, xni Cc: Zhao Heming, lidong.zhong, neilb, colyli Test script (reproducible steps): ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan mdadm --zero-superblock /dev/sd{g,h,i} for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg #mdadm --wait /dev/md0 mdadm --grow --raid-devices=2 /dev/md0 ``` node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and the disk size is more large the issue is more likely to trigger. (more resync time, more easily trigger issues) There is a workaround: when adding the --wait before second --grow, the issue 1 will disappear. Rootcause: In cluster env, every node can start resync job even if the resync cmd doesn't execute on it. e.g. There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A. Current update_raid_disks() only check local recovery status, it's incomplete. How to fix: The simple & clear solution is block the reshape action in initiator side. When node is executing "--grow" and detecting there is ongoing resyncing, it should immediately return & report error to user space. Signed-off-by: Zhao Heming <heming.zhao@suse.com> --- drivers/md/md.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 98bac4f304ae..74280e353b8f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) return -EINVAL; if (mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || mddev->reshape_position != MaxSector) return -EBUSY; @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) } } - if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) - update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) { + ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (ret) + pr_warn("md: updating array disks failed. %d\n", ret); + } /* * Since mddev->delta_disks has already updated in update_raid_disks, -- 2.27.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job 2020-11-15 4:30 ` [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming @ 2020-11-16 9:05 ` Song Liu 0 siblings, 0 replies; 6+ messages in thread From: Song Liu @ 2020-11-16 9:05 UTC (permalink / raw) To: Zhao Heming Cc: linux-raid, Guoqing Jiang, Xiao Ni, lidong.zhong, NeilBrown, Coly Li On Sat, Nov 14, 2020 at 8:30 PM Zhao Heming <heming.zhao@suse.com> wrote: > [...] > > Signed-off-by: Zhao Heming <heming.zhao@suse.com> The fix makes sense to me. But I really hope we can improve the commit log. I have made some changes to it with a couple TODOs for you (see below). Please read it, fill the TODOs, and revise 2/2. Thanks, Song md/cluster: block reshape with remote resync job Reshape request should be blocked with ongoing resync job. In cluster env, a node can start resync job even if the resync cmd isn't executed on it, e.g., user executes "mdadm --grow" on node A, sometimes node B will start resync job. However, current update_raid_disks() only check local recovery status, which is incomplete. As a result, we see (TODO describe observed issue). Fix this issue by blocking reshape request. When node executes "--grow" and detects ongoing resync, it should stop and report error to user. The following script reproduces the issue with (TODO: ???%) probability. ``` # on node1, node2 is the remote node. mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` Cc: <stable@vger.kernel.org> Signed-off-by: Zhao Heming <heming.zhao@suse.com> > --- > drivers/md/md.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 98bac4f304ae..74280e353b8f 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c [...] ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-15 4:30 [PATCH v3 0/2] md/cluster bugs fix Zhao Heming 2020-11-15 4:30 ` [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming @ 2020-11-15 4:30 ` Zhao Heming 2020-11-15 4:39 ` heming.zhao 2020-11-17 3:21 ` Xiao Ni 1 sibling, 2 replies; 6+ messages in thread From: Zhao Heming @ 2020-11-15 4:30 UTC (permalink / raw) To: linux-raid, song, guoqing.jiang, xni Cc: Zhao Heming, lidong.zhong, neilb, colyli There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") My patch fixed issue is very like commit 0ba959774e93, except <c>. 0ba959774e93 step <c> is "update sb", my fix is "mdadm --remove" ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever. ``` Explaination: a> A send METADATA_UPDATED this will block all other nodes to send msg in cluster. b> B does sync jobs, so it will send RESYNCING at intervals. this will be block for holding token_lockres:EX lock. ``` md_do_sync raid1_sync_request resync_info_update sendmsg //with mddev_locked: false lock_comm + wait_event(cinfo->wait, | !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); + lock_token //for step<a>, block holding EX lock + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking ``` c> B do "--remove" action and will send REMOVE. this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. ``` md_ioctl + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d> + hot_remove_disk remove_disk sendmsg //with mddev_locked: true lock_comm wait_event(cinfo->wait, !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking ``` d> B recv METADATA_UPDATED msg which send from A in step <a>. this will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) ``` process_metadata_update wait_event(mddev->thread->wqueue, (got_lock = mddev_trylock(mddev)) || test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)); ``` Repro steps: Test env node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. Test script ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` Test result test script will hung when executing "mdadm --remove". ``` node1 # ps axj | grep mdadm 1 5423 5227 2231 ? -1 D 0 0:00 mdadm /dev/md0 --remove /dev/sdg node1 # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdi[2] sdh[1] sdg[0](F) 1046528 blocks super 1.2 [2/1] [_U] [>....................] recovery = 0.0% (1/1046528) finish=354.0min speed=47K/sec bitmap: 1/1 pages [4KB], 1024KB chunk unused devices: <none> node2 # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdi[2] sdg[0](F) sdh[1] 1046528 blocks super 1.2 [2/1] [_U] bitmap: 1/1 pages [4KB], 1024KB chunk unused devices: <none> node1 # echo t > /proc/sysrq-trigger md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` How to fix: Use the same way of metadata_update_start. lock_comm & metadata_update_start are two equal users that want to get token lock. lock_comm could do the same steps like metadata_update_start. The patch moves setting MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock perfectly. At last, thanks for Xiao's solution. Signed-off-by: Zhao Heming <heming.zhao@suse.com> --- drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------ drivers/md/md.c | 6 ++-- 2 files changed, 43 insertions(+), 32 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 4aaf4820b6f6..405bcc5d513e 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -664,9 +664,27 @@ static void recv_daemon(struct md_thread *thread) * Takes the lock on the TOKEN lock resource so no other * node can communicate while the operation is underway. */ -static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) +static int lock_token(struct md_cluster_info *cinfo) { - int error, set_bit = 0; + int error; + + error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); + if (error) { + pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", + __func__, __LINE__, error); + } else { + /* Lock the receive sequence */ + mutex_lock(&cinfo->recv_mutex); + } + return error; +} + +/* lock_comm() + * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. + */ +static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) +{ + int rv, set_bit = 0; struct mddev *mddev = cinfo->mddev; /* @@ -677,34 +695,19 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) */ if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)) { - error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, + rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); - WARN_ON_ONCE(error); + WARN_ON_ONCE(rv); md_wakeup_thread(mddev->thread); set_bit = 1; } - error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); - if (set_bit) - clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); - if (error) - pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", - __func__, __LINE__, error); - - /* Lock the receive sequence */ - mutex_lock(&cinfo->recv_mutex); - return error; -} - -/* lock_comm() - * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. - */ -static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) -{ wait_event(cinfo->wait, - !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); - - return lock_token(cinfo, mddev_locked); + !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); + rv = lock_token(cinfo); + if (set_bit) + clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); + return rv; } static void unlock_comm(struct md_cluster_info *cinfo) @@ -784,9 +787,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, { int ret; - lock_comm(cinfo, mddev_locked); - ret = __sendmsg(cinfo, cmsg); - unlock_comm(cinfo); + ret = lock_comm(cinfo, mddev_locked); + if (!ret) { + ret = __sendmsg(cinfo, cmsg); + unlock_comm(cinfo); + } return ret; } @@ -1061,7 +1066,7 @@ static int metadata_update_start(struct mddev *mddev) return 0; } - ret = lock_token(cinfo, 1); + ret = lock_token(cinfo); clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); return ret; } @@ -1255,7 +1260,10 @@ static void update_size(struct mddev *mddev, sector_t old_dev_sectors) int raid_slot = -1; md_update_sb(mddev, 1); - lock_comm(cinfo, 1); + if (lock_comm(cinfo, 1)) { + pr_err("%s: lock_comm failed\n", __func__); + return; + } memset(&cmsg, 0, sizeof(cmsg)); cmsg.type = cpu_to_le32(METADATA_UPDATED); @@ -1407,7 +1415,8 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev) cmsg.type = cpu_to_le32(NEWDISK); memcpy(cmsg.uuid, uuid, 16); cmsg.raid_slot = cpu_to_le32(rdev->desc_nr); - lock_comm(cinfo, 1); + if (lock_comm(cinfo, 1)) + return -EAGAIN; ret = __sendmsg(cinfo, &cmsg); if (ret) { unlock_comm(cinfo); diff --git a/drivers/md/md.c b/drivers/md/md.c index 74280e353b8f..46da165afde2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6948,8 +6948,10 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) goto busy; kick_rdev: - if (mddev_is_clustered(mddev)) - md_cluster_ops->remove_disk(mddev, rdev); + if (mddev_is_clustered(mddev)) { + if (md_cluster_ops->remove_disk(mddev, rdev)) + goto busy; + } md_kick_rdev_from_array(rdev); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); -- 2.27.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-15 4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming @ 2020-11-15 4:39 ` heming.zhao 2020-11-17 3:21 ` Xiao Ni 1 sibling, 0 replies; 6+ messages in thread From: heming.zhao @ 2020-11-15 4:39 UTC (permalink / raw) To: linux-raid, song, guoqing.jiang, xni; +Cc: lidong.zhong, neilb, colyli On 11/15/20 12:30 PM, Zhao Heming wrote: > ... ... > How to fix: > > Use the same way of metadata_update_start. lock_comm & > metadata_update_start are two equal users that want to get token lock. > lock_comm could do the same steps like metadata_update_start. > The patch moves setting MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from > lock_token to lock_comm. It enlarge a little bit window of > MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock > perfectly. > > At last, thanks for Xiao's solution. > > for easy review, I paste key fix code from v3 patch. legacy code (original upstream code): ``` static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) { int error, set_bit = 0; struct mddev *mddev = cinfo->mddev; /* * If resync thread run after raid1d thread, then process_metadata_update * could not continue if raid1d held reconfig_mutex (and raid1d is blocked * since another node already got EX on Token and waitting the EX of Ack), * so let resync wake up thread in case flag is set. */ if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)) { error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); WARN_ON_ONCE(error); md_wakeup_thread(mddev->thread); set_bit = 1; } error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); if (set_bit) clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); if (error) pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", __func__, __LINE__, error); /* Lock the receive sequence */ mutex_lock(&cinfo->recv_mutex); return error; } /* lock_comm() * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. */ static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) { wait_event(cinfo->wait, !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); return lock_token(cinfo, mddev_locked); } ``` v3 patch: ``` static int lock_token(struct md_cluster_info *cinfo) { int error; error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); if (error) { pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", __func__, __LINE__, error); } else { /* Lock the receive sequence */ mutex_lock(&cinfo->recv_mutex); } return error; } /* lock_comm() * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. */ static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) { int rv, set_bit = 0; struct mddev *mddev = cinfo->mddev; /* * If resync thread run after raid1d thread, then process_metadata_update * could not continue if raid1d held reconfig_mutex (and raid1d is blocked * since another node already got EX on Token and waitting the EX of Ack), * so let resync wake up thread in case flag is set. */ if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)) { rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); WARN_ON_ONCE(rv); md_wakeup_thread(mddev->thread); set_bit = 1; } wait_event(cinfo->wait, !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); rv = lock_token(cinfo); if (set_bit) clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); return rv; } ``` ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-15 4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming 2020-11-15 4:39 ` heming.zhao @ 2020-11-17 3:21 ` Xiao Ni 1 sibling, 0 replies; 6+ messages in thread From: Xiao Ni @ 2020-11-17 3:21 UTC (permalink / raw) To: Zhao Heming, linux-raid, song, guoqing.jiang; +Cc: lidong.zhong, neilb, colyli Hi Heming For the code part, it's good for me. For the comments part, please modify as Song's requested. Reviewed-by: Xiao Ni <xni@redhat.com> On 11/15/2020 12:30 PM, Zhao Heming wrote: > There is a similar deadlock in commit 0ba959774e93 > ("md-cluster: use sync way to handle METADATA_UPDATED msg") > My patch fixed issue is very like commit 0ba959774e93, except <c>. > 0ba959774e93 step <c> is "update sb", my fix is "mdadm --remove" > > ``` > nodeA nodeB > -------------------- -------------------- > a. > send METADATA_UPDATED > held token_lockres:EX > b. > md_do_sync > resync_info_update > send RESYNCING > + set MD_CLUSTER_SEND_LOCK > + wait for holding token_lockres:EX > > c. > mdadm /dev/md0 --remove /dev/sdg > + held reconfig_mutex > + send REMOVE > + wait_event(MD_CLUSTER_SEND_LOCK) > > d. > recv_daemon //METADATA_UPDATED from A > process_metadata_update > + (mddev_trylock(mddev) || > MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) > //this time, both return false forever. > ``` > > Explaination: > > a> > A send METADATA_UPDATED > this will block all other nodes to send msg in cluster. > > b> > B does sync jobs, so it will send RESYNCING at intervals. > this will be block for holding token_lockres:EX lock. > ``` > md_do_sync > raid1_sync_request > resync_info_update > sendmsg //with mddev_locked: false > lock_comm > + wait_event(cinfo->wait, > | !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); > + lock_token //for step<a>, block holding EX lock > + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking > ``` > c> > B do "--remove" action and will send REMOVE. > this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. > ``` > md_ioctl > + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d> > + hot_remove_disk > remove_disk > sendmsg //with mddev_locked: true > lock_comm > wait_event(cinfo->wait, > !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking > ``` > d> > B recv METADATA_UPDATED msg which send from A in step <a>. > this will be blocked by step <c>: holding mddev lock, it makes > wait_event can't hold mddev lock. (btw, > MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) > ``` > process_metadata_update > wait_event(mddev->thread->wqueue, > (got_lock = mddev_trylock(mddev)) || > test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)); > ``` > > Repro steps: > > Test env > > node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. > > Test script > > ``` > ssh root@node2 "mdadm -S --scan" > mdadm -S --scan > for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ > count=20; done > > echo "mdadm create array" > mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ > --bitmap-chunk=1M > echo "set up array on node2" > ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" > > sleep 5 > > mkfs.xfs /dev/md0 > mdadm --manage --add /dev/md0 /dev/sdi > mdadm --wait /dev/md0 > mdadm --grow --raid-devices=3 /dev/md0 > > mdadm /dev/md0 --fail /dev/sdg > mdadm /dev/md0 --remove /dev/sdg > mdadm --grow --raid-devices=2 /dev/md0 > ``` > > Test result > > test script will hung when executing "mdadm --remove". > > ``` > node1 # ps axj | grep mdadm > 1 5423 5227 2231 ? -1 D 0 0:00 mdadm /dev/md0 --remove /dev/sdg > > node1 # cat /proc/mdstat > Personalities : [raid1] > md0 : active raid1 sdi[2] sdh[1] sdg[0](F) > 1046528 blocks super 1.2 [2/1] [_U] > [>....................] recovery = 0.0% (1/1046528) > finish=354.0min speed=47K/sec > bitmap: 1/1 pages [4KB], 1024KB chunk > > unused devices: <none> > > node2 # cat /proc/mdstat > Personalities : [raid1] > md0 : active raid1 sdi[2] sdg[0](F) sdh[1] > 1046528 blocks super 1.2 [2/1] [_U] > bitmap: 1/1 pages [4KB], 1024KB chunk > > unused devices: <none> > > node1 # echo t > /proc/sysrq-trigger > md0_cluster_rec D 0 5329 2 0x80004000 > Call Trace: > __schedule+0x1f6/0x560 > ? _cond_resched+0x2d/0x40 > ? schedule+0x4a/0xb0 > ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] > ? wait_woken+0x80/0x80 > ? process_recvd_msg+0x113/0x1d0 [md_cluster] > ? recv_daemon+0x9e/0x120 [md_cluster] > ? md_thread+0x94/0x160 [md_mod] > ? wait_woken+0x80/0x80 > ? md_congested+0x30/0x30 [md_mod] > ? kthread+0x115/0x140 > ? __kthread_bind_mask+0x60/0x60 > ? ret_from_fork+0x1f/0x40 > > mdadm D 0 5423 1 0x00004004 > Call Trace: > __schedule+0x1f6/0x560 > ? __schedule+0x1fe/0x560 > ? schedule+0x4a/0xb0 > ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] > ? wait_woken+0x80/0x80 > ? remove_disk+0x4f/0x90 [md_cluster] > ? hot_remove_disk+0xb1/0x1b0 [md_mod] > ? md_ioctl+0x50c/0xba0 [md_mod] > ? wait_woken+0x80/0x80 > ? blkdev_ioctl+0xa2/0x2a0 > ? block_ioctl+0x39/0x40 > ? ksys_ioctl+0x82/0xc0 > ? __x64_sys_ioctl+0x16/0x20 > ? do_syscall_64+0x5f/0x150 > ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > md0_resync D 0 5425 2 0x80004000 > Call Trace: > __schedule+0x1f6/0x560 > ? schedule+0x4a/0xb0 > ? dlm_lock_sync+0xa1/0xd0 [md_cluster] > ? wait_woken+0x80/0x80 > ? lock_token+0x2d/0x90 [md_cluster] > ? resync_info_update+0x95/0x100 [md_cluster] > ? raid1_sync_request+0x7d3/0xa40 [raid1] > ? md_do_sync.cold+0x737/0xc8f [md_mod] > ? md_thread+0x94/0x160 [md_mod] > ? md_congested+0x30/0x30 [md_mod] > ? kthread+0x115/0x140 > ? __kthread_bind_mask+0x60/0x60 > ? ret_from_fork+0x1f/0x40 > ``` > > How to fix: > > Use the same way of metadata_update_start. lock_comm & > metadata_update_start are two equal users that want to get token lock. > lock_comm could do the same steps like metadata_update_start. > The patch moves setting MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from > lock_token to lock_comm. It enlarge a little bit window of > MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock > perfectly. > > At last, thanks for Xiao's solution. > > Signed-off-by: Zhao Heming <heming.zhao@suse.com> > --- > drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------ > drivers/md/md.c | 6 ++-- > 2 files changed, 43 insertions(+), 32 deletions(-) > > diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c > index 4aaf4820b6f6..405bcc5d513e 100644 > --- a/drivers/md/md-cluster.c > +++ b/drivers/md/md-cluster.c > @@ -664,9 +664,27 @@ static void recv_daemon(struct md_thread *thread) > * Takes the lock on the TOKEN lock resource so no other > * node can communicate while the operation is underway. > */ > -static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) > +static int lock_token(struct md_cluster_info *cinfo) > { > - int error, set_bit = 0; > + int error; > + > + error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); > + if (error) { > + pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", > + __func__, __LINE__, error); > + } else { > + /* Lock the receive sequence */ > + mutex_lock(&cinfo->recv_mutex); > + } > + return error; > +} > + > +/* lock_comm() > + * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. > + */ > +static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) > +{ > + int rv, set_bit = 0; > struct mddev *mddev = cinfo->mddev; > > /* > @@ -677,34 +695,19 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) > */ > if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, > &cinfo->state)) { > - error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, > + rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, > &cinfo->state); > - WARN_ON_ONCE(error); > + WARN_ON_ONCE(rv); > md_wakeup_thread(mddev->thread); > set_bit = 1; > } > - error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); > - if (set_bit) > - clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); > > - if (error) > - pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", > - __func__, __LINE__, error); > - > - /* Lock the receive sequence */ > - mutex_lock(&cinfo->recv_mutex); > - return error; > -} > - > -/* lock_comm() > - * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. > - */ > -static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) > -{ > wait_event(cinfo->wait, > - !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); > - > - return lock_token(cinfo, mddev_locked); > + !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); > + rv = lock_token(cinfo); > + if (set_bit) > + clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); > + return rv; > } > > static void unlock_comm(struct md_cluster_info *cinfo) > @@ -784,9 +787,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, > { > int ret; > > - lock_comm(cinfo, mddev_locked); > - ret = __sendmsg(cinfo, cmsg); > - unlock_comm(cinfo); > + ret = lock_comm(cinfo, mddev_locked); > + if (!ret) { > + ret = __sendmsg(cinfo, cmsg); > + unlock_comm(cinfo); > + } > return ret; > } > > @@ -1061,7 +1066,7 @@ static int metadata_update_start(struct mddev *mddev) > return 0; > } > > - ret = lock_token(cinfo, 1); > + ret = lock_token(cinfo); > clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); > return ret; > } > @@ -1255,7 +1260,10 @@ static void update_size(struct mddev *mddev, sector_t old_dev_sectors) > int raid_slot = -1; > > md_update_sb(mddev, 1); > - lock_comm(cinfo, 1); > + if (lock_comm(cinfo, 1)) { > + pr_err("%s: lock_comm failed\n", __func__); > + return; > + } > > memset(&cmsg, 0, sizeof(cmsg)); > cmsg.type = cpu_to_le32(METADATA_UPDATED); > @@ -1407,7 +1415,8 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev) > cmsg.type = cpu_to_le32(NEWDISK); > memcpy(cmsg.uuid, uuid, 16); > cmsg.raid_slot = cpu_to_le32(rdev->desc_nr); > - lock_comm(cinfo, 1); > + if (lock_comm(cinfo, 1)) > + return -EAGAIN; > ret = __sendmsg(cinfo, &cmsg); > if (ret) { > unlock_comm(cinfo); > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 74280e353b8f..46da165afde2 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -6948,8 +6948,10 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) > goto busy; > > kick_rdev: > - if (mddev_is_clustered(mddev)) > - md_cluster_ops->remove_disk(mddev, rdev); > + if (mddev_is_clustered(mddev)) { > + if (md_cluster_ops->remove_disk(mddev, rdev)) > + goto busy; > + } > > md_kick_rdev_from_array(rdev); > set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-11-17 3:21 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-15 4:30 [PATCH v3 0/2] md/cluster bugs fix Zhao Heming 2020-11-15 4:30 ` [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming 2020-11-16 9:05 ` Song Liu 2020-11-15 4:30 ` [PATCH v3 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming 2020-11-15 4:39 ` heming.zhao 2020-11-17 3:21 ` Xiao Ni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).