* [PATCH 0/2] md-cluster bugs fix @ 2020-11-08 14:52 Zhao Heming 2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Zhao Heming @ 2020-11-08 14:52 UTC (permalink / raw) To: linux-raid, song, guoqing.jiang Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli Hello List, I filed two patches to fix 2 different md-cluster bugs. For easy understanding, Let us call issue 1 (releted with patch 1), and issue 2 (related with patch 2). *** Test env *** node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. The disk size is more large the issues are more likely to trigger. (more resync time, more easily trigger issues) *** Test script *** Issue 1 & 2 can use same test script to trigger: ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg #mdadm --wait /dev/md0 mdadm --grow --raid-devices=2 /dev/md0 ``` There is a workaround: when adding the --wait before second --grow, the issue 1 will disappear. *** error behavior *** issue 1: test script can finish every cmds in script, but array status is wrong. issue 2: test script will hung when executing "mdadm --remove". array status of issue 1: (part of output by: mdadm -D /dev/md0) <case 1> : normal test result. ``` Number Major Minor RaidDevice State 1 8 112 0 active sync /dev/sdh 2 8 128 1 active sync /dev/sdi ``` <case 2> : "--faulty" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi 0 8 96 - faulty /dev/sdg ``` <case 3> : "--remove" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi ``` array status of issue 2: Hunging info: ``` node1 # ps axj | grep mdadm 1 5423 5227 2231 ? -1 D 0 0:00 mdadm /dev/md0 --remove /dev/sdg node1 # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdi[2] sdh[1] sdg[0](F) 1046528 blocks super 1.2 [2/1] [_U] [>....................] recovery = 0.0% (1/1046528) finish=354.0min speed=47K/sec bitmap: 1/1 pages [4KB], 1024KB chunk unused devices: <none> node2 # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdi[2] sdg[0](F) sdh[1] 1046528 blocks super 1.2 [2/1] [_U] bitmap: 1/1 pages [4KB], 1024KB chunk unused devices: <none> echo t > /proc/sysrq-trigger md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` *** analysis *** <issue 1> In cluster env, every node can start resync job even if the resync cmd doesn't execute on it. e.g. There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A. problem: Current update_raid_disks() only check local recovery status, it's incomplete. issue scenario ``` node A node B ------------------------------ --------------- mdadm --grow -n 3 md0 + raid1_reshape mddev->raid_disks: 2=>3 start resync job, it will block A resync job mddev->raid_disks: 2=>3 mdadm md0 --fail sdg + update disk: array sb & bitmap sb + send METADATA_UPDATE (resync job blocking) (B continue doing "--grow -n 3" resync job) recv METADATA_UPDATE + read disk metadata + raid1_error + set MD_RECOVERY_INTR to break resync ... ... md_check_recovery + remove_and_add_spares return 1 + set MD_RECOVERY_RECOVER, later restart resync mdadm md0 --remove sdg + md_cluster_ops->remove_disk | + send REMOVE + md_kick_rdev_from_array + update disk: array sb & bitmap sb (resync job blocking) (B continue doing "--grow -n 3" resync job) recv REMOVE + process_remove_disk doesn't set mddev->sb_flags, so it doesn't update disk sb & bitmap sb. ...... md_check_recovery + md_kick_rdev_from_array mdadm --grow -n 2 md0 + raid1_reshape | mddev->raid_disks: 3=>2 + send METADATA_UPDATED (B continue doing "--grow -n 3" resync job) recv METADATA_UPDATE + check_sb_changes update_raid_disks return -EBUSY update failed for mddev->raid_disks: 3=>2 (B never successfully update mddev->raid_disks: 3=>2) when B finish "--grow -n 3" resync job + use mddev->raid_disks:3 to update array sb & bitmap sb + send METADATA_UPDATED recv METADATA_UPDATED + read wrong raid_disks to update kernel data. ``` <issue 2> First, There is a similar deadlock in commit 0ba959774e93911caff596de6391f085fb640ac4 Let me explain commit 0ba959774e first. ``` <origin scenario> nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. update sb + held reconfig_mutex + metadata_update_start + wait_event(MD_CLUSTER_SEND_LOCK) //blocking from <b> d. recv_daemon //METADATA_UPDATED from A process_metadata_update + mddev_trylock(mddev) return false //blocking from <c> <after introduction "MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD"> nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. update sb + held reconfig_mutex + metadata_update_start wait for + set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD //for breaking <d> + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this will non-block & break deadlock. ``` the issue 2 is very like 0ba959774e, except <c>. ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever. ``` commit 0ba959774e9391 uses MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD to break deadlock, but in issue 2, it won't help. md_cluster_ops->remove_disk called from: - state_store() this doesn't hold reconfig_mutex - hot_remove_disk() this must hold reconfig_mutex There are two method to fix. 1. like commit 0ba959774e, held reconfig_mutex in state_store, and set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in remove_disk. 2. like patch 2, change wait_event to wait_event_timeout in lock_comm & process_metadata_update. there are some reason I prefer method 2: - I am not very familiar with all scenario in state_store(). I am not sure if holding reconfig_mutex can cause new bug/issue. - It looks all sendmsg cases could trigger issue 2. Current we found two cases: (maybe there have other cases) - update sb (see commit 0ba959774) - mdadm --remove (issue 2) we should break the deadlock in key code (wait_event => wait_event_timeout). ------- v1: - create patch ------- Zhao Heming (2): md/cluster: reshape should returns error when remote doing resyncing job md/cluster: fix deadlock when doing reshape job drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++--------------- drivers/md/md.c | 8 ++++++-- 2 files changed, 33 insertions(+), 17 deletions(-) -- 2.27.0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job 2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming @ 2020-11-08 14:53 ` Zhao Heming 2020-11-09 18:01 ` Song Liu 2020-11-10 6:38 ` Guoqing Jiang 2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming 2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu 2 siblings, 2 replies; 12+ messages in thread From: Zhao Heming @ 2020-11-08 14:53 UTC (permalink / raw) To: linux-raid, song, guoqing.jiang Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli Test script (reproducible steps): ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan mdadm --zero-superblock /dev/sd{g,h,i} for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg #mdadm --wait /dev/md0 mdadm --grow --raid-devices=2 /dev/md0 ``` node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and the disk size is more large the issue is more likely to trigger. (more resync time, more easily trigger issues) There is a workaround: when adding the --wait before second --grow, the issue 1 will disappear. Rootcause: In cluster env, every node can start resync job even if the resync cmd doesn't execute on it. e.g. There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A. Current update_raid_disks() only check local recovery status, it's incomplete. How to fix: The simple & clear solution is block the reshape action in initiator side. When node is executing "--grow" and detecting there is ongoing resyncing, it should immediately return & report error to user space. Signed-off-by: Zhao Heming <heming.zhao@suse.com> --- drivers/md/md.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 98bac4f304ae..74280e353b8f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) return -EINVAL; if (mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || mddev->reshape_position != MaxSector) return -EBUSY; @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) } } - if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) - update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) { + ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (ret) + pr_warn("md: updating array disks failed. %d\n", ret); + } /* * Since mddev->delta_disks has already updated in update_raid_disks, -- 2.27.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job 2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming @ 2020-11-09 18:01 ` Song Liu 2020-11-10 6:38 ` Guoqing Jiang 1 sibling, 0 replies; 12+ messages in thread From: Song Liu @ 2020-11-09 18:01 UTC (permalink / raw) To: Zhao Heming Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li On Sun, Nov 8, 2020 at 6:53 AM Zhao Heming <heming.zhao@suse.com> wrote: > [...] > How to fix: > > The simple & clear solution is block the reshape action in initiator > side. When node is executing "--grow" and detecting there is ongoing > resyncing, it should immediately return & report error to user space. > > Signed-off-by: Zhao Heming <heming.zhao@suse.com> The code looks good to me. But please revise the commit log as something similar to the following. ========================== 8< ========================== md/cluster: block reshape requests with resync job initiated from remote node In cluster env, a node can start resync job when the resync cmd was executed on a different node. Reshape requests should be blocked for resync job initiated by any node. Current code only <condition to block reshape requests>. This results in a dead lock in <condition> (see repro steps below). Fix this by <adding the extra check>. Repro steps: ... ========================== 8< ========================== In this way, whoever reading the commit log, which could be yourself in 2021, will understand the primary goal of this change quickly. Does this make sense? Thanks, Song > --- > drivers/md/md.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 98bac4f304ae..74280e353b8f 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) > return -EINVAL; > if (mddev->sync_thread || > test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || > + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || > mddev->reshape_position != MaxSector) > return -EBUSY; > > @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) > } > } > > - if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) > - update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); > + if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) { > + ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); > + if (ret) > + pr_warn("md: updating array disks failed. %d\n", ret); > + } > > /* > * Since mddev->delta_disks has already updated in update_raid_disks, > -- > 2.27.0 > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job 2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming 2020-11-09 18:01 ` Song Liu @ 2020-11-10 6:38 ` Guoqing Jiang 2020-11-10 6:59 ` heming.zhao 1 sibling, 1 reply; 12+ messages in thread From: Guoqing Jiang @ 2020-11-10 6:38 UTC (permalink / raw) To: Zhao Heming, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli On 11/8/20 15:53, Zhao Heming wrote: > Test script (reproducible steps): > ``` > ssh root@node2 "mdadm -S --scan" > mdadm -S --scan > mdadm --zero-superblock /dev/sd{g,h,i} > for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ > count=20; done > > echo "mdadm create array" > mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh > echo "set up array on node2" > ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" > > sleep 5 > > mdadm --manage --add /dev/md0 /dev/sdi > mdadm --wait /dev/md0 > mdadm --grow --raid-devices=3 /dev/md0 > > mdadm /dev/md0 --fail /dev/sdg > mdadm /dev/md0 --remove /dev/sdg > #mdadm --wait /dev/md0 > mdadm --grow --raid-devices=2 /dev/md0 > ``` > What is the result after the above steps? Deadlock or something else. > node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and > the disk size is more large the issue is more likely to trigger. (more > resync time, more easily trigger issues) > > There is a workaround: > when adding the --wait before second --grow, the issue 1 will disappear. > > Rootcause: > > In cluster env, every node can start resync job even if the resync cmd > doesn't execute on it. > e.g. > There are two node A & B. User executes "mdadm --grow" on A, sometime B > will start resync job not A. > > Current update_raid_disks() only check local recovery status, it's > incomplete. > > How to fix: > > The simple & clear solution is block the reshape action in initiator > side. When node is executing "--grow" and detecting there is ongoing > resyncing, it should immediately return & report error to user space. > > Signed-off-by: Zhao Heming <heming.zhao@suse.com> > --- > drivers/md/md.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 98bac4f304ae..74280e353b8f 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) > return -EINVAL; > if (mddev->sync_thread || > test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || > + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || > mddev->reshape_position != MaxSector) > return -EBUSY; > > @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) > } > } > > - if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) > - update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); > + if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) { > + ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); > + if (ret) > + pr_warn("md: updating array disks failed. %d\n", ret); > + } > > /* > * Since mddev->delta_disks has already updated in update_raid_disks, > Generally, I think it is good. Thanks, Guoqing ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job 2020-11-10 6:38 ` Guoqing Jiang @ 2020-11-10 6:59 ` heming.zhao 0 siblings, 0 replies; 12+ messages in thread From: heming.zhao @ 2020-11-10 6:59 UTC (permalink / raw) To: Guoqing Jiang, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli On 11/10/20 2:38 PM, Guoqing Jiang wrote: > > > On 11/8/20 15:53, Zhao Heming wrote: >> Test script (reproducible steps): >> ``` >> ssh root@node2 "mdadm -S --scan" >> mdadm -S --scan >> mdadm --zero-superblock /dev/sd{g,h,i} >> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ >> count=20; done >> >> echo "mdadm create array" >> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh >> echo "set up array on node2" >> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" >> >> sleep 5 >> >> mdadm --manage --add /dev/md0 /dev/sdi >> mdadm --wait /dev/md0 >> mdadm --grow --raid-devices=3 /dev/md0 >> >> mdadm /dev/md0 --fail /dev/sdg >> mdadm /dev/md0 --remove /dev/sdg >> #mdadm --wait /dev/md0 >> mdadm --grow --raid-devices=2 /dev/md0 >> ``` >> > > What is the result after the above steps? Deadlock or something else. The result was writen in cover-letter, in the "*** error behavior ***". I will add the result as comments in V2 patch. > >> node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB, and >> the disk size is more large the issue is more likely to trigger. (more >> resync time, more easily trigger issues) >> >> There is a workaround: >> when adding the --wait before second --grow, the issue 1 will disappear. >> >> ... ... >> + if (ret) >> + pr_warn("md: updating array disks failed. %d\n", ret); >> + } >> /* >> * Since mddev->delta_disks has already updated in update_raid_disks, >> > > Generally, I think it is good. > > Thanks, > Guoqing > ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming 2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming @ 2020-11-08 14:53 ` Zhao Heming 2020-11-09 2:02 ` heming.zhao 2020-11-10 6:36 ` Guoqing Jiang 2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu 2 siblings, 2 replies; 12+ messages in thread From: Zhao Heming @ 2020-11-08 14:53 UTC (permalink / raw) To: linux-raid, song, guoqing.jiang Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") This issue is very like 0ba959774e93, except <c>. 0ba959774e93 step c is "update sb", this issue is "mdadm --remove" ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever. ``` Explaination: a> A send METADATA_UPDATED this will block all other nodes to send msg in cluster. b> B does sync jobs, so it will send RESYNCING at intervals. this will be block for holding token_lockres:EX lock. ``` md_do_sync raid1_sync_request resync_info_update sendmsg //with mddev_locked: false lock_comm + wait_event(cinfo->wait, | !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); + lock_token //for step<a>, block holding EX lock + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking ``` c> B do "--remove" action and will send REMOVE. this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. ``` md_ioctl + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d> + hot_remove_disk remove_disk sendmsg lock_comm wait_event(cinfo->wait, !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking ``` d> B recv METADATA_UPDATED msg which send from A in step <a>. this will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) ``` process_metadata_update wait_event(mddev->thread->wqueue, (got_lock = mddev_trylock(mddev)) || test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)); ``` How to fix: There are two sides to fix (or break the dead loop): 1. on sending msg side, modify lock_comm, change it to return success/failed. This will make mdadm cmd return error when lock_comm is timeout. 2. on receiving msg side, process_metadata_update need to add error handling. currently, other msg types won't trigger error or error doesn't need to return sender. So only process_metadata_update need to modify. Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side. Signed-off-by: Zhao Heming <heming.zhao@suse.com> --- drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++--------------- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 4aaf4820b6f6..d59a033e7589 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -523,19 +523,24 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) } -static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) +static int process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) { - int got_lock = 0; + int got_lock = 0, rv; struct md_cluster_info *cinfo = mddev->cluster_info; mddev->good_device_nr = le32_to_cpu(msg->raid_slot); dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); - wait_event(mddev->thread->wqueue, - (got_lock = mddev_trylock(mddev)) || - test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)); - md_reload_sb(mddev, mddev->good_device_nr); - if (got_lock) - mddev_unlock(mddev); + rv = wait_event_timeout(mddev->thread->wqueue, + (got_lock = mddev_trylock(mddev)) || + test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state), + msecs_to_jiffies(5000)); + if (rv) { + md_reload_sb(mddev, mddev->good_device_nr); + if (got_lock) + mddev_unlock(mddev); + return 0; + } + return -1; } static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg) @@ -578,7 +583,7 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) return -1; switch (le32_to_cpu(msg->type)) { case METADATA_UPDATED: - process_metadata_update(mddev, msg); + ret = process_metadata_update(mddev, msg); break; case CHANGE_CAPACITY: set_capacity(mddev->gendisk, mddev->array_sectors); @@ -701,10 +706,15 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) */ static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) { - wait_event(cinfo->wait, - !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); + int rv; - return lock_token(cinfo, mddev_locked); + rv = wait_event_timeout(cinfo->wait, + !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state), + msecs_to_jiffies(5000)); + if (rv) + return lock_token(cinfo, mddev_locked); + else + return -1; } static void unlock_comm(struct md_cluster_info *cinfo) @@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, { int ret; - lock_comm(cinfo, mddev_locked); - ret = __sendmsg(cinfo, cmsg); - unlock_comm(cinfo); + ret = lock_comm(cinfo, mddev_locked); + if (!ret) { + ret = __sendmsg(cinfo, cmsg); + unlock_comm(cinfo); + } return ret; } -- 2.27.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming @ 2020-11-09 2:02 ` heming.zhao 2020-11-09 18:06 ` Song Liu 2020-11-10 6:36 ` Guoqing Jiang 1 sibling, 1 reply; 12+ messages in thread From: heming.zhao @ 2020-11-09 2:02 UTC (permalink / raw) To: linux-raid, song, guoqing.jiang; +Cc: lidong.zhong, xni, neilb, colyli Please note, I gave two solutions for this bug in cover-letter. This patch uses solution 2. For detail, please check cover-letter. Thank you. On 11/8/20 10:53 PM, Zhao Heming wrote: > There is a similar deadlock in commit 0ba959774e93 > ("md-cluster: use sync way to handle METADATA_UPDATED msg") > > This issue is very like 0ba959774e93, except <c>. > 0ba959774e93 step c is "update sb", this issue is "mdadm --remove" > > ``` > nodeA nodeB > -------------------- -------------------- > a. > send METADATA_UPDATED > held token_lockres:EX > b. > md_do_sync > resync_info_update > send RESYNCING > + set MD_CLUSTER_SEND_LOCK > + wait for holding token_lockres:EX > > c. > mdadm /dev/md0 --remove /dev/sdg > + held reconfig_mutex > + send REMOVE > + wait_event(MD_CLUSTER_SEND_LOCK) > > d. > recv_daemon //METADATA_UPDATED from A > process_metadata_update > + (mddev_trylock(mddev) || > MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) > //this time, both return false forever. > ``` > > Explaination: > > a> > A send METADATA_UPDATED > this will block all other nodes to send msg in cluster. > > b> > B does sync jobs, so it will send RESYNCING at intervals. > this will be block for holding token_lockres:EX lock. > ``` > md_do_sync > raid1_sync_request > resync_info_update > sendmsg //with mddev_locked: false > lock_comm > + wait_event(cinfo->wait, > | !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); > + lock_token //for step<a>, block holding EX lock > + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking > ``` > c> > B do "--remove" action and will send REMOVE. > this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. > ``` > md_ioctl > + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d> > + hot_remove_disk > remove_disk > sendmsg > lock_comm > wait_event(cinfo->wait, > !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking > ``` > d> > B recv METADATA_UPDATED msg which send from A in step <a>. > this will be blocked by step <c>: holding mddev lock, it makes > wait_event can't hold mddev lock. (btw, > MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) > ``` > process_metadata_update > wait_event(mddev->thread->wqueue, > (got_lock = mddev_trylock(mddev)) || > test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)); > ``` > > How to fix: > > There are two sides to fix (or break the dead loop): > 1. on sending msg side, modify lock_comm, change it to return > success/failed. > This will make mdadm cmd return error when lock_comm is timeout. > 2. on receiving msg side, process_metadata_update need to add error > handling. > currently, other msg types won't trigger error or error doesn't need > to return sender. So only process_metadata_update need to modify. > > Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side. > > Signed-off-by: Zhao Heming <heming.zhao@suse.com> > --- > drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++--------------- > 1 file changed, 27 insertions(+), 15 deletions(-) > > diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c > index 4aaf4820b6f6..d59a033e7589 100644 > --- a/drivers/md/md-cluster.c > +++ b/drivers/md/md-cluster.c > @@ -523,19 +523,24 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) > } > > > -static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) > +static int process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) > { > - int got_lock = 0; > + int got_lock = 0, rv; > struct md_cluster_info *cinfo = mddev->cluster_info; > mddev->good_device_nr = le32_to_cpu(msg->raid_slot); > > dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); > - wait_event(mddev->thread->wqueue, > - (got_lock = mddev_trylock(mddev)) || > - test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)); > - md_reload_sb(mddev, mddev->good_device_nr); > - if (got_lock) > - mddev_unlock(mddev); > + rv = wait_event_timeout(mddev->thread->wqueue, > + (got_lock = mddev_trylock(mddev)) || > + test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state), > + msecs_to_jiffies(5000)); > + if (rv) { > + md_reload_sb(mddev, mddev->good_device_nr); > + if (got_lock) > + mddev_unlock(mddev); > + return 0; > + } > + return -1; > } > > static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg) > @@ -578,7 +583,7 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) > return -1; > switch (le32_to_cpu(msg->type)) { > case METADATA_UPDATED: > - process_metadata_update(mddev, msg); > + ret = process_metadata_update(mddev, msg); > break; > case CHANGE_CAPACITY: > set_capacity(mddev->gendisk, mddev->array_sectors); > @@ -701,10 +706,15 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) > */ > static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) > { > - wait_event(cinfo->wait, > - !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); > + int rv; > > - return lock_token(cinfo, mddev_locked); > + rv = wait_event_timeout(cinfo->wait, > + !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state), > + msecs_to_jiffies(5000)); > + if (rv) > + return lock_token(cinfo, mddev_locked); > + else > + return -1; > } > > static void unlock_comm(struct md_cluster_info *cinfo) > @@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, > { > int ret; > > - lock_comm(cinfo, mddev_locked); > - ret = __sendmsg(cinfo, cmsg); > - unlock_comm(cinfo); > + ret = lock_comm(cinfo, mddev_locked); > + if (!ret) { > + ret = __sendmsg(cinfo, cmsg); > + unlock_comm(cinfo); > + } > return ret; > } > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-09 2:02 ` heming.zhao @ 2020-11-09 18:06 ` Song Liu 2020-11-10 2:24 ` heming.zhao 0 siblings, 1 reply; 12+ messages in thread From: Song Liu @ 2020-11-09 18:06 UTC (permalink / raw) To: heming.zhao Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li On Sun, Nov 8, 2020 at 6:02 PM heming.zhao@suse.com <heming.zhao@suse.com> wrote: > > Please note, I gave two solutions for this bug in cover-letter. > This patch uses solution 2. For detail, please check cover-letter. > > Thank you. > [...] > > > > How to fix: > > > > There are two sides to fix (or break the dead loop): > > 1. on sending msg side, modify lock_comm, change it to return > > success/failed. > > This will make mdadm cmd return error when lock_comm is timeout. > > 2. on receiving msg side, process_metadata_update need to add error > > handling. > > currently, other msg types won't trigger error or error doesn't need > > to return sender. So only process_metadata_update need to modify. > > > > Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side. > > Similar comments on how to make the commit log easy to understand. Besides that, please split the change into two commits, for fix #1 and #2 respectively. Thanks, Song ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-09 18:06 ` Song Liu @ 2020-11-10 2:24 ` heming.zhao 0 siblings, 0 replies; 12+ messages in thread From: heming.zhao @ 2020-11-10 2:24 UTC (permalink / raw) To: Song Liu Cc: linux-raid, Guoqing Jiang, lidong.zhong, Xiao Ni, NeilBrown, Coly Li On 11/10/20 2:06 AM, Song Liu wrote: > On Sun, Nov 8, 2020 at 6:02 PM heming.zhao@suse.com > <heming.zhao@suse.com> wrote: >> >> Please note, I gave two solutions for this bug in cover-letter. >> This patch uses solution 2. For detail, please check cover-letter. >> >> Thank you. >> > > [...] > >>> >>> How to fix: >>> >>> There are two sides to fix (or break the dead loop): >>> 1. on sending msg side, modify lock_comm, change it to return >>> success/failed. >>> This will make mdadm cmd return error when lock_comm is timeout. >>> 2. on receiving msg side, process_metadata_update need to add error >>> handling. >>> currently, other msg types won't trigger error or error doesn't need >>> to return sender. So only process_metadata_update need to modify. >>> >>> Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side. >>> > > Similar comments on how to make the commit log easy to understand. > Besides that, please split the change into two commits, for fix #1 and #2 > respectively. > My comment meaning is that solution 2 also has two sub-solutions: sending side or receiving side. (but in fact, there are 3 sub-solutions: sending, receiving & both sides) sending side, related with patch 2 functions: sendmsg & lock_comm (code flow: sendmsg => lock_comm) receiving side, related with patch 2 functions: process_recvd_msg & process_metadata_update (code flow: process_recvd_msg => process_metadata_update) To break any side waiting can break deadlock. In the patch 2, my fix is both sides. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming 2020-11-09 2:02 ` heming.zhao @ 2020-11-10 6:36 ` Guoqing Jiang 2020-11-11 12:13 ` heming.zhao 1 sibling, 1 reply; 12+ messages in thread From: Guoqing Jiang @ 2020-11-10 6:36 UTC (permalink / raw) To: Zhao Heming, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli On 11/8/20 15:53, Zhao Heming wrote: > There is a similar deadlock in commit 0ba959774e93 > ("md-cluster: use sync way to handle METADATA_UPDATED msg") > > This issue is very like 0ba959774e93, except <c>. > 0ba959774e93 step c is "update sb", this issue is "mdadm --remove" > > ``` > nodeA nodeB > -------------------- -------------------- > a. > send METADATA_UPDATED > held token_lockres:EX > b. > md_do_sync > resync_info_update > send RESYNCING > + set MD_CLUSTER_SEND_LOCK > + wait for holding token_lockres:EX > > c. > mdadm /dev/md0 --remove /dev/sdg > + held reconfig_mutex > + send REMOVE > + wait_event(MD_CLUSTER_SEND_LOCK) > > d. > recv_daemon //METADATA_UPDATED from A > process_metadata_update > + (mddev_trylock(mddev) || > MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) > //this time, both return false forever. > ``` > > Explaination: > > a> > A send METADATA_UPDATED > this will block all other nodes to send msg in cluster. > > b> > B does sync jobs, so it will send RESYNCING at intervals. > this will be block for holding token_lockres:EX lock. > ``` > md_do_sync > raid1_sync_request > resync_info_update > sendmsg //with mddev_locked: false > lock_comm > + wait_event(cinfo->wait, > | !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); > + lock_token //for step<a>, block holding EX lock > + dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); // blocking > ``` > c> > B do "--remove" action and will send REMOVE. > this will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. > ``` > md_ioctl > + mddev_lock(mddev) //holding reconfig_mutex, it influnces <d> > + hot_remove_disk > remove_disk > sendmsg > lock_comm > wait_event(cinfo->wait, > !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));//blocking > ``` > d> > B recv METADATA_UPDATED msg which send from A in step <a>. > this will be blocked by step <c>: holding mddev lock, it makes > wait_event can't hold mddev lock. (btw, > MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) > ``` > process_metadata_update > wait_event(mddev->thread->wqueue, > (got_lock = mddev_trylock(mddev)) || > test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)); > ``` > > How to fix: > > There are two sides to fix (or break the dead loop): > 1. on sending msg side, modify lock_comm, change it to return > success/failed. > This will make mdadm cmd return error when lock_comm is timeout. > 2. on receiving msg side, process_metadata_update need to add error > handling. > currently, other msg types won't trigger error or error doesn't need > to return sender. So only process_metadata_update need to modify. > > Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side. It's better if you put the deadlock information (such as the 'D' task stack) to the header. > > Signed-off-by: Zhao Heming <heming.zhao@suse.com> > --- > drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++--------------- > 1 file changed, 27 insertions(+), 15 deletions(-) > > diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c > index 4aaf4820b6f6..d59a033e7589 100644 > --- a/drivers/md/md-cluster.c > +++ b/drivers/md/md-cluster.c > @@ -523,19 +523,24 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) > } > > > -static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) > +static int process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) > { > - int got_lock = 0; > + int got_lock = 0, rv; > struct md_cluster_info *cinfo = mddev->cluster_info; > mddev->good_device_nr = le32_to_cpu(msg->raid_slot); > > dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); > - wait_event(mddev->thread->wqueue, > - (got_lock = mddev_trylock(mddev)) || > - test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)); > - md_reload_sb(mddev, mddev->good_device_nr); > - if (got_lock) > - mddev_unlock(mddev); > + rv = wait_event_timeout(mddev->thread->wqueue, > + (got_lock = mddev_trylock(mddev)) || > + test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state), > + msecs_to_jiffies(5000)); > + if (rv) { > + md_reload_sb(mddev, mddev->good_device_nr); > + if (got_lock) > + mddev_unlock(mddev); > + return 0; > + } > + return -1; > } > > static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg) > @@ -578,7 +583,7 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) > return -1; > switch (le32_to_cpu(msg->type)) { > case METADATA_UPDATED: > - process_metadata_update(mddev, msg); > + ret = process_metadata_update(mddev, msg); > break; > case CHANGE_CAPACITY: > set_capacity(mddev->gendisk, mddev->array_sectors); > @@ -701,10 +706,15 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) > */ > static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) > { > - wait_event(cinfo->wait, > - !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); > + int rv; > > - return lock_token(cinfo, mddev_locked); > + rv = wait_event_timeout(cinfo->wait, > + !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state), > + msecs_to_jiffies(5000)); > + if (rv) > + return lock_token(cinfo, mddev_locked); > + else > + return -1; > } > > static void unlock_comm(struct md_cluster_info *cinfo) > @@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, > { > int ret; > > - lock_comm(cinfo, mddev_locked); > - ret = __sendmsg(cinfo, cmsg); > - unlock_comm(cinfo); > + ret = lock_comm(cinfo, mddev_locked); > + if (!ret) { > + ret = __sendmsg(cinfo, cmsg); > + unlock_comm(cinfo); > + } > return ret; > } > What happens if the cluster has latency issue? Let's say, node normally can't get lock during the 5s window. IOW, is the timeout value justified well? Thanks, Guoqing ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job 2020-11-10 6:36 ` Guoqing Jiang @ 2020-11-11 12:13 ` heming.zhao 0 siblings, 0 replies; 12+ messages in thread From: heming.zhao @ 2020-11-11 12:13 UTC (permalink / raw) To: Guoqing Jiang, linux-raid, song; +Cc: lidong.zhong, xni, neilb, colyli On 11/10/20 2:36 PM, Guoqing Jiang wrote: > > > On 11/8/20 15:53, Zhao Heming wrote: >> There is a similar deadlock in commit 0ba959774e93 >> ("md-cluster: use sync way to handle METADATA_UPDATED msg") >> ... ... >> static void unlock_comm(struct md_cluster_info *cinfo) >> @@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, >> { >> int ret; >> - lock_comm(cinfo, mddev_locked); >> - ret = __sendmsg(cinfo, cmsg); >> - unlock_comm(cinfo); >> + ret = lock_comm(cinfo, mddev_locked); >> + if (!ret) { >> + ret = __sendmsg(cinfo, cmsg); >> + unlock_comm(cinfo); >> + } >> return ret; >> } > > What happens if the cluster has latency issue? Let's say, node normally can't get lock during the 5s window. IOW, is the timeout value justified well? 5 seconds is a random number first from my brain when I was coding. below I gave more info for my patch. The key of patch 2 is to change from wait_event to wait_event_timeout. After applying patch, code logic of some functions are also changed. I will analyze the two sides: send & receive *** send side*** Before applying patch, sendmsg only fail when __sendmsg return error. it means dlm layer fails to convert lock. But currently code should do access the return value of lock_comm. After applying patch, there will have new fail cases: 5s timeout. all related functions: resync_bitmap - void, only print error info when error. update_bitmap_size - return err. caller already handles error case. resync_info_update - return err. caller ignore. because this is info msg, it can safely ignore. remove_disk - return err. part of caller handle error case. I forgot to modify hot_remove_disk(), will add it in v2 patch. So in future v2 patch, all callers will handle error case. gather_bitmaps - return err, caller already handles error case We can see there is only one function which doesn't care return value: resync_bitmap this function is used in leave path. if the msg doesn't send out (5s timeout), the result is other nodes won't know the failure event by BITMAP_NEEDS_SYNC. But if my understand is correct, even if missing BITMAP_NEEDS_SYNC, there is another api recover_slot(), which is triggered by dlm and do the same job. So all the sending side related functions are safe. *** receive side *** process_metadata_update the patch means it won't do metadata update job when 5s timeout. and I change my mind, 5s timeout is wrong. Receive side should do as more as possible to handle incomming msg. if there is a 5s timeout code branch, there will tigger inconsistent issue. e.g. A does --faulty, send METADATA_UPDATE to B, B receives the msg, but meets 5s timeout. it won't to update kernel md info. and it will have a gap between kernel md info and disk metadata. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] md-cluster bugs fix 2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming 2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming 2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming @ 2020-11-09 17:43 ` Song Liu 2 siblings, 0 replies; 12+ messages in thread From: Song Liu @ 2020-11-09 17:43 UTC (permalink / raw) To: Zhao Heming, Guoqing Jiang Cc: linux-raid, lidong.zhong, Xiao Ni, NeilBrown, Coly Li Hi Guoqing, On Sun, Nov 8, 2020 at 6:53 AM Zhao Heming <heming.zhao@suse.com> wrote: > [...] Could you please help review the set? It looks good to me. But it will be great to have your review. Thanks, Song > - update sb (see commit 0ba959774) > - mdadm --remove (issue 2) > we should break the deadlock in key code (wait_event => wait_event_timeout). > > ------- > v1: > - create patch > > ------- > Zhao Heming (2): > md/cluster: reshape should returns error when remote doing resyncing > job > md/cluster: fix deadlock when doing reshape job > > drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++--------------- > drivers/md/md.c | 8 ++++++-- > 2 files changed, 33 insertions(+), 17 deletions(-) > > -- > 2.27.0 > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-11-11 12:15 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming 2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming 2020-11-09 18:01 ` Song Liu 2020-11-10 6:38 ` Guoqing Jiang 2020-11-10 6:59 ` heming.zhao 2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming 2020-11-09 2:02 ` heming.zhao 2020-11-09 18:06 ` Song Liu 2020-11-10 2:24 ` heming.zhao 2020-11-10 6:36 ` Guoqing Jiang 2020-11-11 12:13 ` heming.zhao 2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).