From: Greg KH <gregkh@linuxfoundation.org>
To: Zhao Heming <heming.zhao@suse.com>
Cc: linux-raid@vger.kernel.org, song@kernel.org,
guoqing.jiang@cloud.ionos.com, xni@redhat.com,
lidong.zhong@suse.com, neilb@suse.de, colyli@suse.de,
stable@vger.kernel.org
Subject: Re: [PATCH v4 2/2] md/cluster: fix deadlock when node is doing resync job
Date: Wed, 18 Nov 2020 18:14:32 +0100 [thread overview]
Message-ID: <X7VWeJfr3Jh7N2KP@kroah.com> (raw)
In-Reply-To: <1605717954-20173-3-git-send-email-heming.zhao@suse.com>
On Thu, Nov 19, 2020 at 12:45:54AM +0800, Zhao Heming wrote:
> md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
> During sending msg, node can concurrently receive msg from another node.
> When node does resync job, grab token_lockres:EX may trigger a deadlock:
> ```
> nodeA nodeB
> -------------------- --------------------
> a.
> send METADATA_UPDATED
> held token_lockres:EX
> b.
> md_do_sync
> resync_info_update
> send RESYNCING
> + set MD_CLUSTER_SEND_LOCK
> + wait for holding token_lockres:EX
>
> c.
> mdadm /dev/md0 --remove /dev/sdg
> + held reconfig_mutex
> + send REMOVE
> + wait_event(MD_CLUSTER_SEND_LOCK)
>
> d.
> recv_daemon //METADATA_UPDATED from A
> process_metadata_update
> + (mddev_trylock(mddev) ||
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
> //this time, both return false forever
> ```
> Explaination:
> a. A send METADATA_UPDATED
> This will block another node to send msg
>
> b. B does sync jobs, which will send RESYNCING at intervals.
> This will be block for holding token_lockres:EX lock.
>
> c. B do "mdadm --remove", which will send REMOVE.
> This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
>
> d. B recv METADATA_UPDATED msg, which send from A in step <a>.
> This will be blocked by step <c>: holding mddev lock, it makes
> wait_event can't hold mddev lock. (btw,
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
>
> There is a similar deadlock in commit 0ba959774e93
> ("md-cluster: use sync way to handle METADATA_UPDATED msg")
> In that commit, step c is "update sb". This patch step c is
> "mdadm --remove".
>
> For fixing this issue, we can refer the solution of function:
> metadata_update_start. Which does the same grab lock_token action.
> lock_comm can use the same steps to avoid deadlock. By moving
> MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
> It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> but it is safe & can break deadlock.
>
> Repro steps (I only triggered 3 times with hundreds tests):
>
> two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
> ```
> ssh root@node2 "mdadm -S --scan"
> mdadm -S --scan
> for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
> count=20; done
>
> mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
> --bitmap-chunk=1M
> ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
>
> sleep 5
>
> mkfs.xfs /dev/md0
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --wait /dev/md0
> mdadm --grow --raid-devices=3 /dev/md0
>
> mdadm /dev/md0 --fail /dev/sdg
> mdadm /dev/md0 --remove /dev/sdg
> mdadm --grow --raid-devices=2 /dev/md0
> ```
>
> test script will hung when executing "mdadm --remove".
>
> ```
> # dump stacks by "echo t > /proc/sysrq-trigger"
> md0_cluster_rec D 0 5329 2 0x80004000
> Call Trace:
> __schedule+0x1f6/0x560
> ? _cond_resched+0x2d/0x40
> ? schedule+0x4a/0xb0
> ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
> ? wait_woken+0x80/0x80
> ? process_recvd_msg+0x113/0x1d0 [md_cluster]
> ? recv_daemon+0x9e/0x120 [md_cluster]
> ? md_thread+0x94/0x160 [md_mod]
> ? wait_woken+0x80/0x80
> ? md_congested+0x30/0x30 [md_mod]
> ? kthread+0x115/0x140
> ? __kthread_bind_mask+0x60/0x60
> ? ret_from_fork+0x1f/0x40
>
> mdadm D 0 5423 1 0x00004004
> Call Trace:
> __schedule+0x1f6/0x560
> ? __schedule+0x1fe/0x560
> ? schedule+0x4a/0xb0
> ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
> ? wait_woken+0x80/0x80
> ? remove_disk+0x4f/0x90 [md_cluster]
> ? hot_remove_disk+0xb1/0x1b0 [md_mod]
> ? md_ioctl+0x50c/0xba0 [md_mod]
> ? wait_woken+0x80/0x80
> ? blkdev_ioctl+0xa2/0x2a0
> ? block_ioctl+0x39/0x40
> ? ksys_ioctl+0x82/0xc0
> ? __x64_sys_ioctl+0x16/0x20
> ? do_syscall_64+0x5f/0x150
> ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> md0_resync D 0 5425 2 0x80004000
> Call Trace:
> __schedule+0x1f6/0x560
> ? schedule+0x4a/0xb0
> ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
> ? wait_woken+0x80/0x80
> ? lock_token+0x2d/0x90 [md_cluster]
> ? resync_info_update+0x95/0x100 [md_cluster]
> ? raid1_sync_request+0x7d3/0xa40 [raid1]
> ? md_do_sync.cold+0x737/0xc8f [md_mod]
> ? md_thread+0x94/0x160 [md_mod]
> ? md_congested+0x30/0x30 [md_mod]
> ? kthread+0x115/0x140
> ? __kthread_bind_mask+0x60/0x60
> ? ret_from_fork+0x1f/0x40
> ```
>
> At last, thanks for Xiao's solution.
>
> Signed-off-by: Zhao Heming <heming.zhao@suse.com>
> Suggested-by: Xiao Ni <xni@redhat.com>
> Reviewed-by: Xiao Ni <xni@redhat.com>
> ---
> drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------
> drivers/md/md.c | 6 ++--
> 2 files changed, 43 insertions(+), 32 deletions(-)
<formletter>
This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.
</formletter>
next prev parent reply other threads:[~2020-11-18 17:14 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-18 16:45 [PATCH v4 0/2] md/cluster bugs fix Zhao Heming
2020-11-18 16:45 ` [PATCH v4 1/2] md/cluster: block reshape with remote resync job Zhao Heming
2020-11-18 17:14 ` Greg KH
2020-11-18 16:45 ` [PATCH v4 2/2] md/cluster: fix deadlock when node is doing " Zhao Heming
2020-11-18 16:49 ` heming.zhao
2020-11-18 17:14 ` Greg KH [this message]
[not found] <1605786094-5582-1-git-send-email-heming.zhao@suse.com>
2020-11-19 11:41 ` Zhao Heming
2020-11-30 2:31 ` Guoqing Jiang
2020-11-30 4:16 ` heming.zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=X7VWeJfr3Jh7N2KP@kroah.com \
--to=gregkh@linuxfoundation.org \
--cc=colyli@suse.de \
--cc=guoqing.jiang@cloud.ionos.com \
--cc=heming.zhao@suse.com \
--cc=lidong.zhong@suse.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=song@kernel.org \
--cc=stable@vger.kernel.org \
--cc=xni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).