All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] md-cluster bugs fix
@ 2020-11-08 14:52 Zhao Heming
  2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Zhao Heming @ 2020-11-08 14:52 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang
  Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli

Hello List,

I filed two patches to fix 2 different md-cluster bugs.
For easy understanding, Let us call issue 1 (releted with patch 1),
and issue 2 (related with patch 2).

*** Test env ***

node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. The disk
size is more large the issues are more likely to trigger. 
(more resync time, more easily trigger issues)

*** Test script ***

Issue 1 & 2 can use same test script to trigger:

```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

There is a workaround: 
when adding the --wait before second --grow, the issue 1 will disappear.


*** error behavior ***

issue 1: test script can finish every cmds in script, but array status is wrong.
issue 2: test script will hung when executing "mdadm --remove".

array status of issue 1:
(part of output by: mdadm -D /dev/md0)
<case 1> : normal test result.
```
    Number   Major   Minor   RaidDevice State
       1       8      112        0      active sync   /dev/sdh
       2       8      128        1      active sync   /dev/sdi
```

<case 2> : "--faulty" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi

       0       8       96        -      faulty   /dev/sdg
```

<case 3> : "--remove" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi
```

array status of issue 2:
Hunging info:
```
node1 # ps axj | grep mdadm
1  5423  5227  2231 ?    -1 D   0   0:00 mdadm /dev/md0 --remove /dev/sdg

node1 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdh[1] sdg[0](F)
      1046528 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (1/1046528)
finish=354.0min speed=47K/sec
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

node2 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdg[0](F) sdh[1]
      1046528 blocks super 1.2 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

echo t > /proc/sysrq-trigger
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

*** analysis ***

<issue 1>

In cluster env, every node can start resync job even if the resync cmd doesn't execute on it.
e.g.
There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A.

problem:
Current update_raid_disks() only check local recovery status, it's incomplete.

issue scenario
```
  node A                           node B
------------------------------   ---------------
mdadm --grow -n 3 md0
 + raid1_reshape
    mddev->raid_disks: 2=>3
                                 start resync job, it will block A resync job
                                 mddev->raid_disks: 2=>3

mdadm md0 --fail sdg
 + update disk: array sb & bitmap sb
 + send METADATA_UPDATE
 (resync job blocking)
                                 (B continue doing "--grow -n 3" resync job)
                                 recv METADATA_UPDATE
                                  + read disk metadata
                                  + raid1_error
                                  + set MD_RECOVERY_INTR to break resync
                                 ... ...
                                 md_check_recovery
                                  + remove_and_add_spares return 1
                                  + set MD_RECOVERY_RECOVER, later restart resync

mdadm md0 --remove sdg
 + md_cluster_ops->remove_disk
 |  + send REMOVE
 + md_kick_rdev_from_array
 + update disk: array sb & bitmap sb
 (resync job blocking)
                                 (B continue doing "--grow -n 3" resync job)
                                 recv REMOVE
                                  + process_remove_disk doesn't set mddev->sb_flags, 
                                     so it doesn't update disk sb & bitmap sb.
                                 ......
                                 md_check_recovery
                                  + md_kick_rdev_from_array 

mdadm --grow -n 2 md0
 + raid1_reshape
 |  mddev->raid_disks: 3=>2
 + send METADATA_UPDATED

                                 (B continue doing "--grow -n 3" resync job)
                                 recv METADATA_UPDATE
                                  + check_sb_changes
                                     update_raid_disks return -EBUSY
                                     update failed for mddev->raid_disks: 3=>2


                                (B never successfully update mddev->raid_disks: 3=>2)
                                when B finish "--grow -n 3" resync job
                                 + use mddev->raid_disks:3 to update array sb & bitmap sb
                                 + send METADATA_UPDATED

recv METADATA_UPDATED
 + read wrong raid_disks to update
   kernel data.
```

<issue 2>

First, There is a similar deadlock in commit 0ba959774e93911caff596de6391f085fb640ac4

Let me explain commit 0ba959774e first.
```
<origin scenario>
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX
                         
                         c.
                         update sb
                          + held reconfig_mutex
                          + metadata_update_start
                             + wait_event(MD_CLUSTER_SEND_LOCK) //blocking from <b>

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + mddev_trylock(mddev) return false //blocking from <c>


<after introduction "MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD">
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX
                         
                         c.
                         update sb
                          + held reconfig_mutex
                          + metadata_update_start wait for 
                             + set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD //for breaking <d>
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                              //this will non-block & break deadlock.
```

the issue 2 is very like 0ba959774e, except <c>.
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever. 
```

commit 0ba959774e9391 uses MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD to break deadlock, but in issue 2, it won't help.
md_cluster_ops->remove_disk called from:
- state_store()      this doesn't hold reconfig_mutex
- hot_remove_disk()  this must hold reconfig_mutex

There are two method to fix.
1. like commit 0ba959774e, held reconfig_mutex in state_store, and set 
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in remove_disk.
2. like patch 2, change wait_event to wait_event_timeout in lock_comm &
   process_metadata_update.

there are some reason I prefer method 2:
- I am not very familiar with all scenario in state_store(). 
  I am not sure if holding reconfig_mutex can cause new bug/issue.
- It looks all sendmsg cases could trigger issue 2. 
  Current we found two cases: (maybe there have other cases)
  - update sb (see commit 0ba959774)
  - mdadm --remove (issue 2)
  we should break the deadlock in key code (wait_event => wait_event_timeout).

-------
v1:
- create patch

-------
Zhao Heming (2):
  md/cluster: reshape should returns error when remote doing resyncing
    job
  md/cluster: fix deadlock when doing reshape job

 drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
 drivers/md/md.c         |  8 ++++++--
 2 files changed, 33 insertions(+), 17 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 15+ messages in thread
* [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job
@ 2020-11-05 13:11 Zhao Heming
  2020-11-07  0:17 ` Song Liu
  0 siblings, 1 reply; 15+ messages in thread
From: Zhao Heming @ 2020-11-05 13:11 UTC (permalink / raw)
  To: linux-raid, song, guoqing.jiang
  Cc: Zhao Heming, lidong.zhong, xni, neilb, colyli

Test script (reproducible steps):
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
mdadm --zero-superblock /dev/sd{g,h,i}
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

sdg/sdh/sdi are 1GB iscsi luns. The shared disks size is more large the
issue is more likely to trigger.

There is a workaround: when adding the --wait before second --grow,
this bug will disappear.

There are some different test results after running script:
(output by: mdadm -D /dev/md0)
<case 1> : normal test result.
```
    Number   Major   Minor   RaidDevice State
       1       8      112        0      active sync   /dev/sdh
       2       8      128        1      active sync   /dev/sdi
```

<case 2> : "--faulty" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi

       0       8       96        -      faulty   /dev/sdg
```

<case 3> : "--remove" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi
```

Rootcause:
In md-cluster env, it doesn't promise the reshape action (by --grow)
must take place on current node. Any node in cluster has ability
to start resync action, which may be triggered by other node --grow cmd.
md-cluster just uses resync_lockres to make sure only one node can do
resync job.

The key related code (with my patch) is:
```
     if (mddev->sync_thread ||
         test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+        test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
         mddev->reshape_position != MaxSector)
         return -EBUSY;
```
Without test_bit MD_RESYNCING_REMOTE, the 'if' area only handle local
recovery/resync event.
In this bug, the resyncing was doing on another node (let us call it
node2). The initiator side (let us call it node1) start "--grow" cmd, it
calls raid1_reshape and return successfully, (please note node1 doesn't
do resync job). But in node2 (which does resync job), for handling
METADATA_UPDATED (sent by node1), the related code flow:
```
process_metadata_update
 md_reload_sb
  check_sb_changes
   update_raid_disks
```
update_raid_disks returns -EBUSY, but check_sb_changes doesn't handle
return value. So the reshape job doesn't be done by node2. At last node2
will use legacy data (e.g. rdev->raid_disks) to update disk metadata.

How to fix:
The simple & clear solution is block the reshape action in initiator
side. When node is executing "--grow" and detecting there is ongoing
resyncing, it should immediately return & report error to user space.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
 drivers/md/md.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..74280e353b8f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 		return -EINVAL;
 	if (mddev->sync_thread ||
 	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+		test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
 	    mddev->reshape_position != MaxSector)
 		return -EBUSY;
 
@@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}
 
-	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
-		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
+		ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
+		if (ret)
+			pr_warn("md: updating array disks failed. %d\n", ret);
+	}
 
 	/*
 	 * Since mddev->delta_disks has already updated in update_raid_disks,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-11-11 12:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-08 14:52 [PATCH 0/2] md-cluster bugs fix Zhao Heming
2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-09 18:01   ` Song Liu
2020-11-10  6:38   ` Guoqing Jiang
2020-11-10  6:59     ` heming.zhao
2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
2020-11-09  2:02   ` heming.zhao
2020-11-09 18:06     ` Song Liu
2020-11-10  2:24       ` heming.zhao
2020-11-10  6:36   ` Guoqing Jiang
2020-11-11 12:13     ` heming.zhao
2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu
  -- strict thread matches above, loose matches on Subject: below --
2020-11-05 13:11 [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-07  0:17 ` Song Liu
2020-11-07  3:53   ` heming.zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.