linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zhao Heming <heming.zhao@suse.com>
To: linux-raid@vger.kernel.org, song@kernel.org,
	guoqing.jiang@cloud.ionos.com
Cc: Zhao Heming <heming.zhao@suse.com>,
	lidong.zhong@suse.com, xni@redhat.com, neilb@suse.de,
	colyli@suse.de
Subject: [PATCH 0/2] md-cluster bugs fix
Date: Sun,  8 Nov 2020 22:52:59 +0800	[thread overview]
Message-ID: <1604847181-22086-1-git-send-email-heming.zhao@suse.com> (raw)

Hello List,

I filed two patches to fix 2 different md-cluster bugs.
For easy understanding, Let us call issue 1 (releted with patch 1),
and issue 2 (related with patch 2).

*** Test env ***

node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. The disk
size is more large the issues are more likely to trigger. 
(more resync time, more easily trigger issues)

*** Test script ***

Issue 1 & 2 can use same test script to trigger:

```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

echo "mdadm create array"
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
echo "set up array on node2"
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
 #mdadm --wait /dev/md0
mdadm --grow --raid-devices=2 /dev/md0
```

There is a workaround: 
when adding the --wait before second --grow, the issue 1 will disappear.


*** error behavior ***

issue 1: test script can finish every cmds in script, but array status is wrong.
issue 2: test script will hung when executing "mdadm --remove".

array status of issue 1:
(part of output by: mdadm -D /dev/md0)
<case 1> : normal test result.
```
    Number   Major   Minor   RaidDevice State
       1       8      112        0      active sync   /dev/sdh
       2       8      128        1      active sync   /dev/sdi
```

<case 2> : "--faulty" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi

       0       8       96        -      faulty   /dev/sdg
```

<case 3> : "--remove" data still exist on disk metadata area.
```
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi
```

array status of issue 2:
Hunging info:
```
node1 # ps axj | grep mdadm
1  5423  5227  2231 ?    -1 D   0   0:00 mdadm /dev/md0 --remove /dev/sdg

node1 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdh[1] sdg[0](F)
      1046528 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (1/1046528)
finish=354.0min speed=47K/sec
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

node2 # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdi[2] sdg[0](F) sdh[1]
      1046528 blocks super 1.2 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>

echo t > /proc/sysrq-trigger
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

*** analysis ***

<issue 1>

In cluster env, every node can start resync job even if the resync cmd doesn't execute on it.
e.g.
There are two node A & B. User executes "mdadm --grow" on A, sometime B will start resync job not A.

problem:
Current update_raid_disks() only check local recovery status, it's incomplete.

issue scenario
```
  node A                           node B
------------------------------   ---------------
mdadm --grow -n 3 md0
 + raid1_reshape
    mddev->raid_disks: 2=>3
                                 start resync job, it will block A resync job
                                 mddev->raid_disks: 2=>3

mdadm md0 --fail sdg
 + update disk: array sb & bitmap sb
 + send METADATA_UPDATE
 (resync job blocking)
                                 (B continue doing "--grow -n 3" resync job)
                                 recv METADATA_UPDATE
                                  + read disk metadata
                                  + raid1_error
                                  + set MD_RECOVERY_INTR to break resync
                                 ... ...
                                 md_check_recovery
                                  + remove_and_add_spares return 1
                                  + set MD_RECOVERY_RECOVER, later restart resync

mdadm md0 --remove sdg
 + md_cluster_ops->remove_disk
 |  + send REMOVE
 + md_kick_rdev_from_array
 + update disk: array sb & bitmap sb
 (resync job blocking)
                                 (B continue doing "--grow -n 3" resync job)
                                 recv REMOVE
                                  + process_remove_disk doesn't set mddev->sb_flags, 
                                     so it doesn't update disk sb & bitmap sb.
                                 ......
                                 md_check_recovery
                                  + md_kick_rdev_from_array 

mdadm --grow -n 2 md0
 + raid1_reshape
 |  mddev->raid_disks: 3=>2
 + send METADATA_UPDATED

                                 (B continue doing "--grow -n 3" resync job)
                                 recv METADATA_UPDATE
                                  + check_sb_changes
                                     update_raid_disks return -EBUSY
                                     update failed for mddev->raid_disks: 3=>2


                                (B never successfully update mddev->raid_disks: 3=>2)
                                when B finish "--grow -n 3" resync job
                                 + use mddev->raid_disks:3 to update array sb & bitmap sb
                                 + send METADATA_UPDATED

recv METADATA_UPDATED
 + read wrong raid_disks to update
   kernel data.
```

<issue 2>

First, There is a similar deadlock in commit 0ba959774e93911caff596de6391f085fb640ac4

Let me explain commit 0ba959774e first.
```
<origin scenario>
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX
                         
                         c.
                         update sb
                          + held reconfig_mutex
                          + metadata_update_start
                             + wait_event(MD_CLUSTER_SEND_LOCK) //blocking from <b>

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + mddev_trylock(mddev) return false //blocking from <c>


<after introduction "MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD">
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX
                         
                         c.
                         update sb
                          + held reconfig_mutex
                          + metadata_update_start wait for 
                             + set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD //for breaking <d>
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                              //this will non-block & break deadlock.
```

the issue 2 is very like 0ba959774e, except <c>.
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever. 
```

commit 0ba959774e9391 uses MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD to break deadlock, but in issue 2, it won't help.
md_cluster_ops->remove_disk called from:
- state_store()      this doesn't hold reconfig_mutex
- hot_remove_disk()  this must hold reconfig_mutex

There are two method to fix.
1. like commit 0ba959774e, held reconfig_mutex in state_store, and set 
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in remove_disk.
2. like patch 2, change wait_event to wait_event_timeout in lock_comm &
   process_metadata_update.

there are some reason I prefer method 2:
- I am not very familiar with all scenario in state_store(). 
  I am not sure if holding reconfig_mutex can cause new bug/issue.
- It looks all sendmsg cases could trigger issue 2. 
  Current we found two cases: (maybe there have other cases)
  - update sb (see commit 0ba959774)
  - mdadm --remove (issue 2)
  we should break the deadlock in key code (wait_event => wait_event_timeout).

-------
v1:
- create patch

-------
Zhao Heming (2):
  md/cluster: reshape should returns error when remote doing resyncing
    job
  md/cluster: fix deadlock when doing reshape job

 drivers/md/md-cluster.c | 42 ++++++++++++++++++++++++++---------------
 drivers/md/md.c         |  8 ++++++--
 2 files changed, 33 insertions(+), 17 deletions(-)

-- 
2.27.0


             reply	other threads:[~2020-11-08 14:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-08 14:52 Zhao Heming [this message]
2020-11-08 14:53 ` [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job Zhao Heming
2020-11-09 18:01   ` Song Liu
2020-11-10  6:38   ` Guoqing Jiang
2020-11-10  6:59     ` heming.zhao
2020-11-08 14:53 ` [PATCH 2/2] md/cluster: fix deadlock when doing reshape job Zhao Heming
2020-11-09  2:02   ` heming.zhao
2020-11-09 18:06     ` Song Liu
2020-11-10  2:24       ` heming.zhao
2020-11-10  6:36   ` Guoqing Jiang
2020-11-11 12:13     ` heming.zhao
2020-11-09 17:43 ` [PATCH 0/2] md-cluster bugs fix Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1604847181-22086-1-git-send-email-heming.zhao@suse.com \
    --to=heming.zhao@suse.com \
    --cc=colyli@suse.de \
    --cc=guoqing.jiang@cloud.ionos.com \
    --cc=lidong.zhong@suse.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=song@kernel.org \
    --cc=xni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).