From: Zhao Heming <heming.zhao@suse.com>
To: linux-raid@vger.kernel.org, song@kernel.org,
guoqing.jiang@cloud.ionos.com
Cc: Zhao Heming <heming.zhao@suse.com>,
lidong.zhong@suse.com, xni@redhat.com, neilb@suse.de,
colyli@suse.de
Subject: [PATCH 0/2] md-cluster bugs fix
Date: Sun, 8 Nov 2020 22:52:59 +0800 [thread overview]
Message-ID: <1604847181-22086-1-git-send-email-heming.zhao@suse.com> (raw)
Hello List,
I filed two patches to fix 2 different md-cluster bugs.
For easy understanding, Let us call issue 1 (releted with patch 1),
and issue 2 (related with patch 2).
*** Test env ***
node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. The disk
size is more large the issues are more likely to trigger.
(more resync time, more easily trigger issues)
*** Test script ***
Issue 1 & 2 can use same test script to trigger:
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done
echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
#mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```
There is a workaround:
when adding the --wait before second --grow, the issue 1 will disappear.
*** error behavior ***
issue 1: test script can finish every cmds in script, but array status is wrong.
issue 2: test script will hung when executing "mdadm --remove".
array status of issue 1:
(part of output by: mdadm -D /dev/md0)
<case 1> : normal test result.
```
Number Major Minor RaidDevice State
1 8 112 0 active sync /dev/sdh
2 8 128 1 active sync /dev/sdi
```
<case 2> : "--faulty" data still exist on disk metadata area.
```
Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 112 1 active sync /dev/sdh
2 8 128 2 active sync /dev/sdi
0 8 96 - faulty /dev/sdg
```
<case 3> : "--remove" data still exist on disk metadata area.
```
Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 112 1 active sync /dev/sdh
2 8 128 2 active sync /dev/sdi
```
array status of issue 2:
Hunging info:
```
node1 # ps axj | grep mdadm
1 5423 5227 2231 ? -1 D 0 0:00 mdadm /dev/md0 --remove /dev/sdg
node1 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdh[1] sdg[0](F)
1046528 blocks super 1.2 [2/1] [_U]
[>....................] recovery = 0.0% (1/1046528)
finish=354.0min speed=47K/sec
bitmap: 1/1 pages [4KB], 1024KB chunk
unused devices: <none>
node2 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdg[0](F) sdh[1]
1046528 blocks super 1.2 [2/1] [_U]
bitmap: 1/1 pages [4KB], 1024KB chunk
unused devices: <none>
echo t > /proc/sysrq-trigger
md0_cluster_rec D 0 5329 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? _cond_resched+0x2d/0x40
? schedule+0x4a/0xb0
? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
? wait_woken+0x80/0x80
? process_recvd_msg+0x113/0x1d0 [md_cluster]
? recv_daemon+0x9e/0x120 [md_cluster]
? md_thread+0x94/0x160 [md_mod]
? wait_woken+0x80/0x80
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
mdadm D 0 5423 1 0x00004004
Call Trace:
__schedule+0x1f6/0x560
? __schedule+0x1fe/0x560
? schedule+0x4a/0xb0
? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
? wait_woken+0x80/0x80
? remove_disk+0x4f/0x90 [md_cluster]
? hot_remove_disk+0xb1/0x1b0 [md_mod]
? md_ioctl+0x50c/0xba0 [md_mod]
? wait_woken+0x80/0x80
? blkdev_ioctl+0xa2/0x2a0
? block_ioctl+0x39/0x40
? ksys_ioctl+0x82/0xc0
? __x64_sys_ioctl+0x16/0x20
? do_syscall_64+0x5f/0x150
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
md0_resync D 0 5425 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? schedule+0x4a/0xb0
? dlm_lock_sync+0xa1/0xd0 [md_cluster]
? wait_woken+0x80/0x80
? lock_token+0x2d/0x90 [md_cluster]
? resync_info_update+0x95/0x100 [md_cluster]
? raid1_sync_request+0x7d3/0xa40 [raid1]
? md_do_sync.cold+0x737/0xc8f [md_mod]
? md_thread+0x94/0x160 [md_mod]
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
```
*** analysis ***
<issue 1>
In cluster env, every node can start resync job even if the resync cmd doesn't execute on it.
e.g.
There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A.
problem:
Current update_raid_disks() only check local recovery status, it's incomplete.
issue scenario
```
node A node B
------------------------------ ---------------
mdadm --grow -n 3 md0
+ raid1_reshape
mddev->raid_disks: 2=>3
start resync job, it will block A resync job
mddev->raid_disks: 2=>3
mdadm md0 --fail sdg
+ update disk: array sb & bitmap sb
+ send METADATA_UPDATE
(resync job blocking)
(B continue doing "--grow -n 3" resync job)
recv METADATA_UPDATE
+ read disk metadata
+ raid1_error
+ set MD_RECOVERY_INTR to break resync
... ...
md_check_recovery
+ remove_and_add_spares return 1
+ set MD_RECOVERY_RECOVER, later restart resync
mdadm md0 --remove sdg
+ md_cluster_ops->remove_disk
| + send REMOVE
+ md_kick_rdev_from_array
+ update disk: array sb & bitmap sb
(resync job blocking)
(B continue doing "--grow -n 3" resync job)
recv REMOVE
+ process_remove_disk doesn't set mddev->sb_flags,
so it doesn't update disk sb & bitmap sb.
......
md_check_recovery
+ md_kick_rdev_from_array
mdadm --grow -n 2 md0
+ raid1_reshape
| mddev->raid_disks: 3=>2
+ send METADATA_UPDATED
(B continue doing "--grow -n 3" resync job)
recv METADATA_UPDATE
+ check_sb_changes
update_raid_disks return -EBUSY
update failed for mddev->raid_disks: 3=>2
(B never successfully update mddev->raid_disks: 3=>2)
when B finish "--grow -n 3" resync job
+ use mddev->raid_disks:3 to update array sb & bitmap sb
+ send METADATA_UPDATED
recv METADATA_UPDATED
+ read wrong raid_disks to update
kernel data.
```
<issue 2>
First, There is a similar deadlock in commit 0ba959774e93911caff596de6391f085fb640ac4
Let me explain commit 0ba959774e first.
```
<origin scenario>
nodeA nodeB
-------------------- --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
b.
md_do_sync
resync_info_update
send RESYNCING
+ set MD_CLUSTER_SEND_LOCK
+ wait for holding token_lockres:EX
c.
update sb
+ held reconfig_mutex
+ metadata_update_start
+ wait_event(MD_CLUSTER_SEND_LOCK) //blocking from <b>
d.
recv_daemon //METADATA_UPDATED from A
process_metadata_update
+ mddev_trylock(mddev) return false //blocking from <c>
<after introduction "MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD">
nodeA nodeB
-------------------- --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
b.
md_do_sync
resync_info_update
send RESYNCING
+ set MD_CLUSTER_SEND_LOCK
+ wait for holding token_lockres:EX
c.
update sb
+ held reconfig_mutex
+ metadata_update_start wait for
+ set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD //for breaking <d>
+ wait_event(MD_CLUSTER_SEND_LOCK)
d.
recv_daemon //METADATA_UPDATED from A
process_metadata_update
+ (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
//this will non-block & break deadlock.
```
the issue 2 is very like 0ba959774e, except <c>.
```
nodeA nodeB
-------------------- --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
b.
md_do_sync
resync_info_update
send RESYNCING
+ set MD_CLUSTER_SEND_LOCK
+ wait for holding token_lockres:EX
c.
mdadm /dev/md0 --remove /dev/sdg
+ held reconfig_mutex
+ send REMOVE
+ wait_event(MD_CLUSTER_SEND_LOCK)
d.
recv_daemon //METADATA_UPDATED from A
process_metadata_update
+ (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
//this time, both return false forever.
```
commit 0ba959774e9391 uses MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD to break deadlock, but in issue 2, it won't help.
md_cluster_ops->remove_disk called from:
- state_store() this doesn't hold reconfig_mutex
- hot_remove_disk() this must hold reconfig_mutex
There are two method to fix.
1. like commit 0ba959774e, held reconfig_mutex in state_store, and set
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in remove_disk.
2. like patch 2, change wait_event to wait_event_timeout in lock_comm &
process_metadata_update.
there are some reason I prefer method 2:
- I am not very familiar with all scenario in state_store().
I am not sure if holding reconfig_mutex can cause new bug/issue.
- It looks all sendmsg cases could trigger issue 2.
Current we found two cases: (maybe there have other cases)
- update sb (see commit 0ba959774)
- mdadm --remove (issue 2)
we should break the deadlock in key code (wait_event => wait_event_timeout).
-------
v1:
- create patch
-------
Zhao Heming (2):
md/cluster: reshape should returns error when remote doing resyncing
job
md/cluster: fix deadlock when doing reshape job
drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
drivers/md/md.c | 8 ++++++--
2 files changed, 33 insertions(+), 17 deletions(-)
--
2.27.0
next reply other threads:[~2020-11-08 14:53 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-08 14:52 Zhao Heming [this message]
2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-09 18:01 ` Song Liu
2020-11-10 6:38 ` Guoqing Jiang
2020-11-10 6:59 ` heming.zhao
2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
2020-11-09 2:02 ` heming.zhao
2020-11-09 18:06 ` Song Liu
2020-11-10 2:24 ` heming.zhao
2020-11-10 6:36 ` Guoqing Jiang
2020-11-11 12:13 ` heming.zhao
2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1604847181-22086-1-git-send-email-heming.zhao@suse.com \
--to=heming.zhao@suse.com \
--cc=colyli@suse.de \
--cc=guoqing.jiang@cloud.ionos.com \
--cc=lidong.zhong@suse.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=song@kernel.org \
--cc=xni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).