From: Zhao Heming <heming.zhao@suse.com>
To: linux-raid@vger.kernel.org, song@kernel.org,
guoqing.jiang@cloud.ionos.com
Cc: Zhao Heming <heming.zhao@suse.com>,
lidong.zhong@suse.com, xni@redhat.com, neilb@suse.de,
colyli@suse.de
Subject: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
Date: Thu, 5 Nov 2020 21:11:27 +0800 [thread overview]
Message-ID: <1604581888-27659-1-git-send-email-heming.zhao@suse.com> (raw)
Test script (reproducible steps):
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
mdadm --zero-superblock /dev/sd{g,h,i}
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done
echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
#mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```
sdg/sdh/sdi are 1GB iscsi luns. The shared disks size is more large the
issue is more likely to trigger.
There is a workaround: when adding the --wait before second --grow,
this bug will disappear.
There are some different test results after running script:
(output by: mdadm -D /dev/md0)
<case 1> : normal test result.
```
Number Major Minor RaidDevice State
1 8 112 0 active sync /dev/sdh
2 8 128 1 active sync /dev/sdi
```
<case 2> : "--faulty" data still exist on disk metadata area.
```
Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 112 1 active sync /dev/sdh
2 8 128 2 active sync /dev/sdi
0 8 96 - faulty /dev/sdg
```
<case 3> : "--remove" data still exist on disk metadata area.
```
Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 112 1 active sync /dev/sdh
2 8 128 2 active sync /dev/sdi
```
Rootcause:
In md-cluster env, it doesn't promise the reshape action (by --grow)
must take place on current node. Any node in cluster has ability
to start resync action, which may be triggered by other node --grow cmd.
md-cluster just uses resync_lockres to make sure only one node can do
resync job.
The key related code (with my patch) is:
```
if (mddev->sync_thread ||
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+ test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
mddev->reshape_position != MaxSector)
return -EBUSY;
```
Without test_bit MD_RESYNCING_REMOTE, the 'if' area only handle local
recovery/resync event.
In this bug, the resyncing was doing on another node (let us call it
node2). The initiator side (let us call it node1) start "--grow" cmd, it
calls raid1_reshape and return successfully, (please note node1 doesn't
do resync job). But in node2 (which does resync job), for handling
METADATA_UPDATED (sent by node1), the related code flow:
```
process_metadata_update
md_reload_sb
check_sb_changes
update_raid_disks
```
update_raid_disks returns -EBUSY, but check_sb_changes doesn't handle
return value. So the reshape job doesn't be done by node2. At last node2
will use legacy data (e.g. rdev->raid_disks) to update disk metadata.
How to fix:
The simple & clear solution is block the reshape action in initiator
side. When node is executing "--grow" and detecting there is ongoing
resyncing, it should immediately return & report error to user space.
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
drivers/md/md.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..74280e353b8f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
return -EINVAL;
if (mddev->sync_thread ||
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+ test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
mddev->reshape_position != MaxSector)
return -EBUSY;
@@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
}
}
- if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
- update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+ if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
+ ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+ if (ret)
+ pr_warn("md: updating array disks failed. %d\n", ret);
+ }
/*
* Since mddev->delta_disks has already updated in update_raid_disks,
--
2.27.0
next reply other threads:[~2020-11-05 13:11 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-05 13:11 Zhao Heming [this message]
2020-11-05 13:11 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
2020-11-07 0:17 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Song Liu
2020-11-07 3:53 ` heming.zhao
2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming
2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-09 18:01 ` Song Liu
2020-11-10 6:38 ` Guoqing Jiang
2020-11-10 6:59 ` heming.zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1604581888-27659-1-git-send-email-heming.zhao@suse.com \
--to=heming.zhao@suse.com \
--cc=colyli@suse.de \
--cc=guoqing.jiang@cloud.ionos.com \
--cc=lidong.zhong@suse.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=song@kernel.org \
--cc=xni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).