All of lore.kernel.org
 help / color / mirror / Atom feed
* 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
@ 2017-05-22  9:13 Nix
  2017-05-22 11:35 ` NeilBrown
  0 siblings, 1 reply; 29+ messages in thread
From: Nix @ 2017-05-22  9:13 UTC (permalink / raw)
  To: linux-raid, linux-bcache

[I think bcache is blameless here, but there may be some interaction, so
 I've Cc:ed both lists. I'm fairly sure the problem is mddev_suspend()
 being called from level_store() and scheduling after suspension has
 happened, but I'm not sure what the fix might be.]

So I have a box with three RAID arrays on the same physical
rotating-rust devices:

 - a RAID-0 /dev/md/transient atop /dev/sd*2
 - a RAID-5 /dev/md/fast, a bcache backing device (with disabled cache
   for now) atop /dev/sd*3
 - a RAID-5 /dev/md/slow atop /dev/sd*4.

The rootfs and everything the system needs to run, including mdadm, is
on /dev/md/fast. fs-wise, everything is using xfs except for
/dev/md/transient, which houses an ext4 (unjournalled, for speed over
robustness).

I just got a new disk and grew /dev/md/transient to use it without
incident, then grew /dev/md/slow into a RAID-6:

mdadm --add /dev/md/slow /dev/sdf4
mdadm --grow /dev/md/slow --level=6 --backup-file=/.transient/backup-slow

This worked fine.

But the *exact same command* applied to /dev/md/fast hangs:

mdadm --add /dev/md/fast /dev/sdf3
mdadm --grow /dev/md/fast --level=6 --backup-file=/.transient/backup-fast
[hang here]

At this point, all disk I/O to /dev/md/fast appears suspended: because /
is on it, this leaves me rather unable to do anything but reboot.
Thankfully the deadlock seemingly happens before anything crucial is
done, and the machine reboots and brigns up /dev/md/fast as an
unreshaped raid5 array with a spare.

Is this workflow impossible? Do you have to reshape the device the
rootfs is on from early userspace or something like that? If so... this
leaves a crucial mdadm running across the early userspace transition:
what do we do about the fs the backup file is on (since the transition
usually deletes everything)? Do we even *need* a backup file for a
reshape that's growing onto a spare, as the manpage suggests we do? Do
we simply have to leave the machine unavailable for days while the
reshape completes? (I hope not.)

That hangcheck output I was able to capture by eye suggests a deadlock
in block I/O during a brief instant of suspended I/O to the device. The
first hang we see comes from bcache, but since it's in "none" mode this
is pretty clearly just a writeback (note the md_make_request at the top
there):

[  248.877539] Workqueue: bcache cached_dev_nodata
[  248.888590] Call Trace:
[  248.899337]  __schedule+0x290/0x810
[  248.909745]  schedule+0x36/0x80
[  248.920113]  md_make_request+0x9c/0x220
[  248.930229]  ? wake_atomic_t_function+0x60/0x60
[  248.940063]  generic_make_request+0xfc/0x260
[  248.949803]  cached_dev_nodata+0x28/0x80
[  248.959393]  ? cached_dev_nodata+0x28/0x80
[  248.968998]  process_one_work+0x1e3/0x440
[  248.978414]  worker_thread+0x48/0x4d0
[  248.987596]  kthread+0x108/0x140
[  248.996747]  ? process_one_work+0x440/0x440
[  249.005965]  ? __kthread_create_on_node+0x150/0x150
[  249.015264]  ret_from_fork+0x29/0x40
[  249.024544] INFO: task kworker/u40:5:245 blocked for more than 120 seconds.

This is fairly obviously

        /* If it's a flush, we send the flush to the backing device too */
        closure_bio_submit(bio, cl);

in drivers/md/bcache/request.c, which is fine.

After that we start to see xfs writeback hangs, all hanging on
md_make_request() or I/O scheduling:

[  249.062946] Workqueue: writeback wb_workfn (flush-252:5)
[  249.072733] Call Trace:
[  249.082390]  __schedule+0x290/0x810
[  249.092053]  schedule+0x36/0x80
[  249.101568]  md_make_request+0x9c/0x220
[  249.110956]  ? wake_atomic_t_function+0x60/0x60
[  249.120202]  generic_make_request+0xfc/0x260
[  249.129219]  submit_bio+0x64/0x120
[  249.137968]  ? submit_bio+0x64/0x120
[  249.146840]  ? xfs_setfilesize_trans_alloc.isra.4+0x2f/0x70
[  249.155822]  xfs_submit_ioend+0x70/0x190
[  249.164805]  xfs_vm_writepages+0x65/0x70
[  249.173620]  do_writepages+0x1e/0x30
[  249.182239]  __writeback_single_inode+0x45/0x320
[  249.190838]  writeback_sb_inodes+0x191/0x490
[  249.199459]  __writeback_inodes_wb+0x92/0xc0
[  249.207988]  wb_writeback+0x243/0x2d0
[  249.216395]  wb_workfn+0x299/0x360
[  249.224851]  ? wb_workfn+0x299/0x360
[  249.233393]  process_one_work+0x1e3/0x440
[  249.242049]  worker_thread+0x48/0x4d0
[  249.250404]  kthread+0x108/0x140
[  249.258408]  ? process_one_work+0x440/0x440
[  249.266148]  ? __kthread_create_on_node+0x150/0x150
[  249.273652]  ret_from_fork+0x29/0x40

[  249.310928] Workqueue: xfs-cil/dm-5 xlog_cil_push_work
[  249.318453] Call Trace:
[  249.325929]  __schedule+0x290/0x810
[  249.333456]  schedule+0x36/0x80
[  249.340940]  schedule_timeout+0x207/0x360
[  249.348461]  ? cached_dev_make_request+0x2bc/0xd00
[  249.356018]  io_schedule_timeout+0x1e/0x50
[  249.363533]  ? io_schedule_timeout+0x1e/0x50
[  249.371078]  wait_for_common_io.constprop.0+0x92/0x110
[  249.378711]  ? try_to_wake_up+0x400/0x400
[  249.386339]  wait_for_completion_io+0x18/0x20
[  249.394026]  submit_bio_wait+0x59/0x70
[  249.401685]  blkdev_issue_flush+0x5c/0x90
[  249.409321]  xfs_blkdev_issue_flush+0x19/0x20
[  249.416951]  xlog_sync+0x2a1/0x3b0
[  249.424439]  xlog_state_release_iclog+0x6d/0xd0
[  249.431857]  xlog_write+0x64a/0x6e0
[  249.439092]  xlog_cil_push+0x230/0x450
[  249.446169]  xlog_cil_push_work+0x15/0x20
[  249.453128]  process_one_work+0x1e3/0x440
[  249.460030]  worker_thread+0x221/0x4d0
[  249.466942]  kthread+0x108/0x140
[  249.473843]  ? process_one_work+0x440/0x440
[  249.480762]  ? __kthread_create_on_node+0x150/0x150
[  249.487743]  ret_from_fork+0x29/0x40

[  249.524000] Workqueue: xfs-log/dm-5 xfs_log_worker
[  249.531490] Call Trace:
[  249.538923]  __schedule+0x290/0x810
[  249.546366]  schedule+0x36/0x80
[  249.553756]  schedule_timeout+0x207/0x360
[  249.561132]  ? check_preempt_curr+0x79/0x90
[  249.568514]  ? ttwu_do_wakeup.isra.15+0x1e/0x160
[  249.575957]  wait_for_common+0xaa/0x160
[  249.583432]  ? wait_for_common+0xaa/0x160
[  249.590943]  ? try_to_wake_up+0x400/0x400
[  249.598447]  wait_for_completion+0x1d/0x20
[  249.605984]  flush_work+0xfb/0x1b0
[  249.613509]  ? flush_workqueue_prep_pwqs+0x1a0/0x1a0
[  249.621093]  xlog_cil_force_lsn+0x75/0x1d0
[  249.628640]  ? dequeue_task_fair+0x757/0x1920
[  249.636169]  _xfs_log_force+0x76/0x280
[  249.643524]  ? xfs_log_worker+0x32/0x100
[  249.650715]  xfs_log_force+0x2c/0x80
[  249.657719]  xfs_log_worker+0x32/0x100
[  249.664653]  process_one_work+0x1e3/0x440
[  249.671498]  worker_thread+0x48/0x4d0
[  249.678354]  kthread+0x108/0x140
[  249.685218]  ? process_one_work+0x440/0x440
[  249.692123]  ? __kthread_create_on_node+0x150/0x150
[  249.699057]  ret_from_fork+0x29/0x40

[  249.727654] syslog-ng       D    0  1658   1657 0x00000000
[  249.735066] Call Trace:
[  249.742333]  __schedule+0x290/0x810
[  249.749624]  schedule+0x36/0x80
[  249.756915]  md_write_start+0x8f/0x160
[  249.764185]  ? wake_atomic_t_function+0x60/0x60
[  249.771469]  raid5_make_request+0x7a/0xd40
[  249.778762]  ? bch_data_insert_start+0x4da/0x5e0
[  249.786143]  ? wake_atomic_t_function+0x60/0x60
[  249.793571]  md_make_request+0xe3/0x220
[  249.800976]  generic_make_request+0xfc/0x260
[  249.808383]  submit_bio+0x64/0x120
[  249.815838]  ? submit_bio+0x64/0x120
[  249.823253]  ? xfs_setfilesize_trans_alloc.isra.4+0x2f/0x70
[  249.830757]  xfs_submit_ioend+0x70/0x190
[  249.838269]  xfs_vm_writepages+0x65/0x70
[  249.845668]  do_writepages+0x1e/0x30
[  249.852899]  __filemap_fdatawrite_range+0x71/0x90
[  249.860023]  filemap_write_and_wait_range+0x2a/0x70
[  249.867061]  xfs_file_fsync+0x54/0x190
[  249.874007]  vfs_fsync_range+0x49/0xa0
[  249.880934]  ? __fdget_pos+0x17/0x50
[  249.887793]  ? vfs_writev+0x3c/0x50
[  249.894700]  do_fsync+0x3d/0x70
[  249.901565]  SyS_fsync+0x10/0x20

I think we are stuck inside the

        if (mddev->suspended) {

block in md_make_request() in all these cases.

... and finally it becomes clear that mdadm is blocked and we are in
real trouble:

[  249.965701] INFO: task mdadm:2712 blocked for more than 120 seconds.
[  249.973003]       Not tainted 4.11.2-00005-g87b2117c3309-dirty #2
[  249.980381] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  249.987999] mdadm           D    0  2712   2020 0x00000000
[  249.995681] Call Trace:
[  250.003273]  __schedule+0x290/0x810
[  250.010885]  schedule+0x36/0x80
[  250.018503]  mddev_suspend+0xb3/0xe0
[  250.026158]  ? wake_atomic_t_function+0x60/0x60
[  250.033856]  level_store+0x1a7/0x6c0
[  250.041478]  ? md_ioctl+0xb7/0x1c10
[  250.049075]  ? putname+0x53/0x60
[  250.056554]  md_attr_store+0x83/0xc0
[  250.063922]  sysfs_kf_write+0x37/0x40
[  250.071082]  kernfs_fop_write+0x110/0x1a0
[  250.078127]  __vfs_write+0x28/0x120
[  250.085166]  ? kernfs_iop_get_link+0x172/0x1e0
[  250.092199]  ? __alloc_fd+0x3f/0x170
[  250.099271]  vfs_write+0xb6/0x1d0
[  250.106307]  SyS_write+0x46/0xb0
[  250.113227]  entry_SYSCALL_64_fastpath+0x13/0x94

I'm not sure what this is waiting on. It could be *any* of

        synchronize_rcu();
        wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
        mddev->pers->quiesce(mddev, 1);

since they all can probably schedule and if quiescing involves talking
to the device, we have already suspended it, so we're doomed.

I'm not sure how we can quiesce before we suspend and prevent things
from trying to issue more crucial requests (paging in mdadm?) before we
resume again, but I'm fairly sure that *something* different needs to be
done.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22  9:13 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request Nix
@ 2017-05-22 11:35 ` NeilBrown
  2017-05-22 15:30   ` Nix
  2017-05-22 21:38   ` Nix
  0 siblings, 2 replies; 29+ messages in thread
From: NeilBrown @ 2017-05-22 11:35 UTC (permalink / raw)
  To: Nix, linux-raid, linux-bcache

[-- Attachment #1: Type: text/plain, Size: 11912 bytes --]

On Mon, May 22 2017, Nix wrote:

> [I think bcache is blameless here, but there may be some interaction, so
>  I've Cc:ed both lists. I'm fairly sure the problem is mddev_suspend()
>  being called from level_store() and scheduling after suspension has
>  happened, but I'm not sure what the fix might be.]
>
> So I have a box with three RAID arrays on the same physical
> rotating-rust devices:
>
>  - a RAID-0 /dev/md/transient atop /dev/sd*2
>  - a RAID-5 /dev/md/fast, a bcache backing device (with disabled cache
>    for now) atop /dev/sd*3
>  - a RAID-5 /dev/md/slow atop /dev/sd*4.
>
> The rootfs and everything the system needs to run, including mdadm, is
> on /dev/md/fast. fs-wise, everything is using xfs except for
> /dev/md/transient, which houses an ext4 (unjournalled, for speed over
> robustness).
>
> I just got a new disk and grew /dev/md/transient to use it without
> incident, then grew /dev/md/slow into a RAID-6:
>
> mdadm --add /dev/md/slow /dev/sdf4
> mdadm --grow /dev/md/slow --level=6 --backup-file=/.transient/backup-slow
>
> This worked fine.
>
> But the *exact same command* applied to /dev/md/fast hangs:
>
> mdadm --add /dev/md/fast /dev/sdf3
> mdadm --grow /dev/md/fast --level=6 --backup-file=/.transient/backup-fast
> [hang here]
>
> At this point, all disk I/O to /dev/md/fast appears suspended: because /
> is on it, this leaves me rather unable to do anything but reboot.
> Thankfully the deadlock seemingly happens before anything crucial is
> done, and the machine reboots and brigns up /dev/md/fast as an
> unreshaped raid5 array with a spare.
>
> Is this workflow impossible? Do you have to reshape the device the
> rootfs is on from early userspace or something like that? If so... this
> leaves a crucial mdadm running across the early userspace transition:
> what do we do about the fs the backup file is on (since the transition
> usually deletes everything)? Do we even *need* a backup file for a
> reshape that's growing onto a spare, as the manpage suggests we do? Do
> we simply have to leave the machine unavailable for days while the
> reshape completes? (I hope not.)
>
> That hangcheck output I was able to capture by eye suggests a deadlock
> in block I/O during a brief instant of suspended I/O to the device. The
> first hang we see comes from bcache, but since it's in "none" mode this
> is pretty clearly just a writeback (note the md_make_request at the top
> there):
>
> [  248.877539] Workqueue: bcache cached_dev_nodata
> [  248.888590] Call Trace:
> [  248.899337]  __schedule+0x290/0x810
> [  248.909745]  schedule+0x36/0x80
> [  248.920113]  md_make_request+0x9c/0x220
> [  248.930229]  ? wake_atomic_t_function+0x60/0x60
> [  248.940063]  generic_make_request+0xfc/0x260
> [  248.949803]  cached_dev_nodata+0x28/0x80
> [  248.959393]  ? cached_dev_nodata+0x28/0x80
> [  248.968998]  process_one_work+0x1e3/0x440
> [  248.978414]  worker_thread+0x48/0x4d0
> [  248.987596]  kthread+0x108/0x140
> [  248.996747]  ? process_one_work+0x440/0x440
> [  249.005965]  ? __kthread_create_on_node+0x150/0x150
> [  249.015264]  ret_from_fork+0x29/0x40
> [  249.024544] INFO: task kworker/u40:5:245 blocked for more than 120 seconds.
>
> This is fairly obviously
>
>         /* If it's a flush, we send the flush to the backing device too */
>         closure_bio_submit(bio, cl);
>
> in drivers/md/bcache/request.c, which is fine.
>
> After that we start to see xfs writeback hangs, all hanging on
> md_make_request() or I/O scheduling:
>
> [  249.062946] Workqueue: writeback wb_workfn (flush-252:5)
> [  249.072733] Call Trace:
> [  249.082390]  __schedule+0x290/0x810
> [  249.092053]  schedule+0x36/0x80
> [  249.101568]  md_make_request+0x9c/0x220
> [  249.110956]  ? wake_atomic_t_function+0x60/0x60
> [  249.120202]  generic_make_request+0xfc/0x260
> [  249.129219]  submit_bio+0x64/0x120
> [  249.137968]  ? submit_bio+0x64/0x120
> [  249.146840]  ? xfs_setfilesize_trans_alloc.isra.4+0x2f/0x70
> [  249.155822]  xfs_submit_ioend+0x70/0x190
> [  249.164805]  xfs_vm_writepages+0x65/0x70
> [  249.173620]  do_writepages+0x1e/0x30
> [  249.182239]  __writeback_single_inode+0x45/0x320
> [  249.190838]  writeback_sb_inodes+0x191/0x490
> [  249.199459]  __writeback_inodes_wb+0x92/0xc0
> [  249.207988]  wb_writeback+0x243/0x2d0
> [  249.216395]  wb_workfn+0x299/0x360
> [  249.224851]  ? wb_workfn+0x299/0x360
> [  249.233393]  process_one_work+0x1e3/0x440
> [  249.242049]  worker_thread+0x48/0x4d0
> [  249.250404]  kthread+0x108/0x140
> [  249.258408]  ? process_one_work+0x440/0x440
> [  249.266148]  ? __kthread_create_on_node+0x150/0x150
> [  249.273652]  ret_from_fork+0x29/0x40
>
> [  249.310928] Workqueue: xfs-cil/dm-5 xlog_cil_push_work
> [  249.318453] Call Trace:
> [  249.325929]  __schedule+0x290/0x810
> [  249.333456]  schedule+0x36/0x80
> [  249.340940]  schedule_timeout+0x207/0x360
> [  249.348461]  ? cached_dev_make_request+0x2bc/0xd00
> [  249.356018]  io_schedule_timeout+0x1e/0x50
> [  249.363533]  ? io_schedule_timeout+0x1e/0x50
> [  249.371078]  wait_for_common_io.constprop.0+0x92/0x110
> [  249.378711]  ? try_to_wake_up+0x400/0x400
> [  249.386339]  wait_for_completion_io+0x18/0x20
> [  249.394026]  submit_bio_wait+0x59/0x70
> [  249.401685]  blkdev_issue_flush+0x5c/0x90
> [  249.409321]  xfs_blkdev_issue_flush+0x19/0x20
> [  249.416951]  xlog_sync+0x2a1/0x3b0
> [  249.424439]  xlog_state_release_iclog+0x6d/0xd0
> [  249.431857]  xlog_write+0x64a/0x6e0
> [  249.439092]  xlog_cil_push+0x230/0x450
> [  249.446169]  xlog_cil_push_work+0x15/0x20
> [  249.453128]  process_one_work+0x1e3/0x440
> [  249.460030]  worker_thread+0x221/0x4d0
> [  249.466942]  kthread+0x108/0x140
> [  249.473843]  ? process_one_work+0x440/0x440
> [  249.480762]  ? __kthread_create_on_node+0x150/0x150
> [  249.487743]  ret_from_fork+0x29/0x40
>
> [  249.524000] Workqueue: xfs-log/dm-5 xfs_log_worker
> [  249.531490] Call Trace:
> [  249.538923]  __schedule+0x290/0x810
> [  249.546366]  schedule+0x36/0x80
> [  249.553756]  schedule_timeout+0x207/0x360
> [  249.561132]  ? check_preempt_curr+0x79/0x90
> [  249.568514]  ? ttwu_do_wakeup.isra.15+0x1e/0x160
> [  249.575957]  wait_for_common+0xaa/0x160
> [  249.583432]  ? wait_for_common+0xaa/0x160
> [  249.590943]  ? try_to_wake_up+0x400/0x400
> [  249.598447]  wait_for_completion+0x1d/0x20
> [  249.605984]  flush_work+0xfb/0x1b0
> [  249.613509]  ? flush_workqueue_prep_pwqs+0x1a0/0x1a0
> [  249.621093]  xlog_cil_force_lsn+0x75/0x1d0
> [  249.628640]  ? dequeue_task_fair+0x757/0x1920
> [  249.636169]  _xfs_log_force+0x76/0x280
> [  249.643524]  ? xfs_log_worker+0x32/0x100
> [  249.650715]  xfs_log_force+0x2c/0x80
> [  249.657719]  xfs_log_worker+0x32/0x100
> [  249.664653]  process_one_work+0x1e3/0x440
> [  249.671498]  worker_thread+0x48/0x4d0
> [  249.678354]  kthread+0x108/0x140
> [  249.685218]  ? process_one_work+0x440/0x440
> [  249.692123]  ? __kthread_create_on_node+0x150/0x150
> [  249.699057]  ret_from_fork+0x29/0x40
>
> [  249.727654] syslog-ng       D    0  1658   1657 0x00000000
> [  249.735066] Call Trace:
> [  249.742333]  __schedule+0x290/0x810
> [  249.749624]  schedule+0x36/0x80
> [  249.756915]  md_write_start+0x8f/0x160
> [  249.764185]  ? wake_atomic_t_function+0x60/0x60
> [  249.771469]  raid5_make_request+0x7a/0xd40
> [  249.778762]  ? bch_data_insert_start+0x4da/0x5e0
> [  249.786143]  ? wake_atomic_t_function+0x60/0x60
> [  249.793571]  md_make_request+0xe3/0x220
> [  249.800976]  generic_make_request+0xfc/0x260
> [  249.808383]  submit_bio+0x64/0x120
> [  249.815838]  ? submit_bio+0x64/0x120
> [  249.823253]  ? xfs_setfilesize_trans_alloc.isra.4+0x2f/0x70
> [  249.830757]  xfs_submit_ioend+0x70/0x190
> [  249.838269]  xfs_vm_writepages+0x65/0x70
> [  249.845668]  do_writepages+0x1e/0x30
> [  249.852899]  __filemap_fdatawrite_range+0x71/0x90
> [  249.860023]  filemap_write_and_wait_range+0x2a/0x70
> [  249.867061]  xfs_file_fsync+0x54/0x190
> [  249.874007]  vfs_fsync_range+0x49/0xa0
> [  249.880934]  ? __fdget_pos+0x17/0x50
> [  249.887793]  ? vfs_writev+0x3c/0x50
> [  249.894700]  do_fsync+0x3d/0x70
> [  249.901565]  SyS_fsync+0x10/0x20
>
> I think we are stuck inside the
>
>         if (mddev->suspended) {
>
> block in md_make_request() in all these cases.
>
> ... and finally it becomes clear that mdadm is blocked and we are in
> real trouble:
>
> [  249.965701] INFO: task mdadm:2712 blocked for more than 120 seconds.
> [  249.973003]       Not tainted 4.11.2-00005-g87b2117c3309-dirty #2
> [  249.980381] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  249.987999] mdadm           D    0  2712   2020 0x00000000
> [  249.995681] Call Trace:
> [  250.003273]  __schedule+0x290/0x810
> [  250.010885]  schedule+0x36/0x80
> [  250.018503]  mddev_suspend+0xb3/0xe0
> [  250.026158]  ? wake_atomic_t_function+0x60/0x60
> [  250.033856]  level_store+0x1a7/0x6c0
> [  250.041478]  ? md_ioctl+0xb7/0x1c10
> [  250.049075]  ? putname+0x53/0x60
> [  250.056554]  md_attr_store+0x83/0xc0
> [  250.063922]  sysfs_kf_write+0x37/0x40
> [  250.071082]  kernfs_fop_write+0x110/0x1a0
> [  250.078127]  __vfs_write+0x28/0x120
> [  250.085166]  ? kernfs_iop_get_link+0x172/0x1e0
> [  250.092199]  ? __alloc_fd+0x3f/0x170
> [  250.099271]  vfs_write+0xb6/0x1d0
> [  250.106307]  SyS_write+0x46/0xb0
> [  250.113227]  entry_SYSCALL_64_fastpath+0x13/0x94
>
> I'm not sure what this is waiting on. It could be *any* of
>
>         synchronize_rcu();
>         wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
>         mddev->pers->quiesce(mddev, 1);

It says 'schedule()' so it is in wait_event().

>
> since they all can probably schedule and if quiescing involves talking
> to the device, we have already suspended it, so we're doomed.
>
> I'm not sure how we can quiesce before we suspend and prevent things
> from trying to issue more crucial requests (paging in mdadm?) before we
> resume again, but I'm fairly sure that *something* different needs to be
> done.

Congratulations.  You have found a bug that dates from 2011.

Commit: 68866e425be2 ("MD: no sync IO while suspended")

(I think).

A write request gets to raid5_make_request() before mddev_suspend() sets
mddev->suspended.
raid5_make_request() needs md_write_start() to mark the array "dirty"
which it asks the md thread to do, but before the thread gets to the
task, mddev->suspend has been set, so md_check_recovery() doesn't update
the superblock, so md_write_start() doesn't proceed, so the request
never completes, so mddev_suspend() blocks indefinitely.

I think that md_check_recovery() need to test for mddev->suspended
somewhere else.
It needs to allow superblock updates, and probably needs to reap the
recovery thread, but must not start a new recovery thread.

Probably something like this:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f6ae1d67bcd0..dbca31be22a1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
  */
 void md_check_recovery(struct mddev *mddev)
 {
-	if (mddev->suspended)
-		return;
 
 	if (mddev->bitmap)
 		bitmap_daemon_work(mddev);
@@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
 		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
 
 		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
+		    mddev->suspended ||
 		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
 			goto not_running;
 		/* no recovery is running.

though it's late so don't trust anything I write.

If you try again it will almost certainly succeed.  I suspect this is a
hard race to hit - well done!!!

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 11:35 ` NeilBrown
@ 2017-05-22 15:30   ` Nix
  2017-05-22 19:07     ` Wols Lists
  2017-05-23  1:07     ` NeilBrown
  2017-05-22 21:38   ` Nix
  1 sibling, 2 replies; 29+ messages in thread
From: Nix @ 2017-05-22 15:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[linux-bcache removed, not a bcache bug]

On 22 May 2017, NeilBrown said this:

> Congratulations.  You have found a bug that dates from 2011.

Oh so not that old then! (My personal record, in a previous job, was a
bug in allegedly critical functionality that entirely stopped it working
and had been broken for around fourteen years. Like hell it was
critical.)

> Commit: 68866e425be2 ("MD: no sync IO while suspended")
>
> (I think).
>
> A write request gets to raid5_make_request() before mddev_suspend() sets
> mddev->suspended.
> raid5_make_request() needs md_write_start() to mark the array "dirty"
> which it asks the md thread to do, but before the thread gets to the
> task, mddev->suspend has been set, so md_check_recovery() doesn't update
> the superblock, so md_write_start() doesn't proceed, so the request
> never completes, so mddev_suspend() blocks indefinitely.

Ooof. Looks like it's fsync()ing something -- not sure what, though.
The previous time it was a metadata update...

Hm. I note that ->events is meant to be for things like device additions
and array starts. RAID5 -> RAID6 reshaping appears to push this up a
lot:

         Events : 1475948

(it was at around 4000 before this -- itself rather mystifying, given
that I've only rebooted this machine 27 times, and the array's probably
been assembled less than 20 times, and was created with the same number
of devices it has now.)

The other array has only 92 events (it was at 30-something before I
tried this reshape).

> I think that md_check_recovery() need to test for mddev->suspended
> somewhere else.
> It needs to allow superblock updates, and probably needs to reap the
> recovery thread, but must not start a new recovery thread.

My question would be, is all the intricate stuff md_check_recovery() is
doing valid on an mddev that's not only suspended but detached? Because
we detach right after the suspension in level_store()...

(btw, you probably want to remove the comment above enum array_state
that says that "suspended" is "not supported yet", too.)

> Probably something like this:
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d67bcd0..dbca31be22a1 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
>   */
>  void md_check_recovery(struct mddev *mddev)
>  {
> -	if (mddev->suspended)
> -		return;
>  
>  	if (mddev->bitmap)
>  		bitmap_daemon_work(mddev);
> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
>  		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
>  
>  		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> +		    mddev->suspended ||
>  		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
>  			goto not_running;
>  		/* no recovery is running.
>
> though it's late so don't trust anything I write.

This may end up clearing MD_RECOVERY_INTR and MD_RECOVERY_DONE, but I
guess in this context it's safe to do so... looks OK otherwise to the
totally ignorant fool reading this patch!

Hm, being picky now, md_check_recovery() might be slightly easier to
read if that governing mutex_trylock() conditional was inverted:

	if (mddev_trylock(mddev))
		return;

Then you could de-indent most of the function by one indentation
level...

> If you try again it will almost certainly succeed.  I suspect this is a
> hard race to hit - well done!!!

I'll give it a try -- I hit it twice in succession, once with a
--backup-file, once without. Since mdadm does not warn about the lack of
a --backup-file, I guess the statement in the manual that it is
essential to provide one when changing RAID levels is untrue: I suspect
that it's necessary *if* you're not increasing the number of disks at
the same time, but since I'm growing into a spare, adding a
--backup-file only slows it down?

I might run a backup first though. :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 15:30   ` Nix
@ 2017-05-22 19:07     ` Wols Lists
  2017-05-22 20:43       ` Nix
  2017-05-23  1:39       ` NeilBrown
  2017-05-23  1:07     ` NeilBrown
  1 sibling, 2 replies; 29+ messages in thread
From: Wols Lists @ 2017-05-22 19:07 UTC (permalink / raw)
  To: Nix, NeilBrown; +Cc: linux-raid

On 22/05/17 16:30, Nix wrote:
> I'll give it a try -- I hit it twice in succession, once with a
> --backup-file, once without. Since mdadm does not warn about the lack of
> a --backup-file, I guess the statement in the manual that it is
> essential to provide one when changing RAID levels is untrue: I suspect
> that it's necessary *if* you're not increasing the number of disks at
> the same time, but since I'm growing into a spare, adding a
> --backup-file only slows it down?

I did discuss this with Neil while I wrote it, so I hope I got it right :-)

https://raid.wiki.kernel.org/index.php/A_guide_to_mdadm#Array_internals_and_how_it_affects_mdadm

aiui, provided you're using a v1 superblock, the data offset means there
is spare space on the drives precisely for the purpose (one of then at
least) of keeping a backup. So the reshape will start reshaping into the
spare space and eliminate the need for the backup - the new version of
the stripe will be safely written before the space occupied by the old
stripe is required.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 19:07     ` Wols Lists
@ 2017-05-22 20:43       ` Nix
  2017-05-23  1:20         ` NeilBrown
  2017-05-23  1:39       ` NeilBrown
  1 sibling, 1 reply; 29+ messages in thread
From: Nix @ 2017-05-22 20:43 UTC (permalink / raw)
  To: Wols Lists; +Cc: NeilBrown, linux-raid

On 22 May 2017, Wols Lists verbalised:

> aiui, provided you're using a v1 superblock, the data offset means there
> is spare space on the drives precisely for the purpose (one of then at
> least) of keeping a backup. So the reshape will start reshaping into the
> spare space and eliminate the need for the backup - the new version of
> the stripe will be safely written before the space occupied by the old
> stripe is required.

But it's only a few KiB by default! The amount of seeking needed to
reshape with such a small intermediate would be fairly horrific. (It was
bad enough as it was: the reshape of 7TiB took more than two days,
running at under 15MiB/s, though the component drives can all handle
220MiB/s easily. The extra time was spent seeking to and from the
backup, it seems.)

> Typically, the new layout will move the data offset, pushing the data
> block up or down.

Will it? Why? You'd need to move the data offset quite a long way to
make enough space for the 500MiB backup file I saw being created during
my last reshape.

(Though, for me, this is somewhat redundant:

    Data Offset : 262144 sectors

because I set a data offset specifically to allow reshaping without
having to move anything. Is the idea that reshaping that adds data
spindles will move the data offset such that it is (still) on a chunk or
stripe multiple? That's neat, if so, and means I wasted 128MiB on this,
uh, 12TiB array. OK I'm not terribly blown away by this, particularly
given that I'm wasting the same again inside the bcache partition for
the same reason: I'm sure mdadm won't move *that* data offset.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 11:35 ` NeilBrown
  2017-05-22 15:30   ` Nix
@ 2017-05-22 21:38   ` Nix
  2017-05-23 14:16     ` Wols Lists
  2017-05-24  1:24     ` NeilBrown
  1 sibling, 2 replies; 29+ messages in thread
From: Nix @ 2017-05-22 21:38 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 22 May 2017, NeilBrown told this:

> Probably something like this:
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d67bcd0..dbca31be22a1 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
>   */
>  void md_check_recovery(struct mddev *mddev)
>  {
> -	if (mddev->suspended)
> -		return;
>  
>  	if (mddev->bitmap)
>  		bitmap_daemon_work(mddev);
> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
>  		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
>  
>  		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> +		    mddev->suspended ||
>  		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
>  			goto not_running;
>  		/* no recovery is running.
>
> though it's late so don't trust anything I write.
>
> If you try again it will almost certainly succeed.  I suspect this is a
> hard race to hit - well done!!!

Definitely not a hard race to hit :( I just hit it again with this
patch.

Absolutely identical hang:

[  495.833520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  495.840618] mdadm           D    0  2700   2537 0x00000000
[  495.847762] Call Trace:
[  495.854825]  __schedule+0x290/0x810
[  495.861905]  schedule+0x36/0x80
[  495.868934]  mddev_suspend+0xb3/0xe0
[  495.875926]  ? wake_atomic_t_function+0x60/0x60
[  495.882976]  level_store+0x1a7/0x6c0
[  495.889953]  ? md_ioctl+0xb7/0x1c10
[  495.896901]  ? putname+0x53/0x60
[  495.903807]  md_attr_store+0x83/0xc0
[  495.910684]  sysfs_kf_write+0x37/0x40
[  495.917547]  kernfs_fop_write+0x110/0x1a0
[  495.924429]  __vfs_write+0x28/0x120
[  495.931270]  ? kernfs_iop_get_link+0x172/0x1e0
[  495.938126]  ? __alloc_fd+0x3f/0x170
[  495.944906]  vfs_write+0xb6/0x1d0
[  495.951646]  SyS_write+0x46/0xb0
[  495.958338]  entry_SYSCALL_64_fastpath+0x13/0x94

Everything else hangs the same way, too. This was surprising enough that
I double-checked to be sure the patch was applied: it was. I suspect the
deadlock is somewhat different than you supposed... (and quite possibly
not a race at all, or I wouldn't be hitting it so consistently, every
time. I mean, I only need to miss it *once* and I'll have reshaped... :) )

It seems I can reproduce this on demand, so if you want to throw a patch
with piles of extra printks my way, feel free.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 15:30   ` Nix
  2017-05-22 19:07     ` Wols Lists
@ 2017-05-23  1:07     ` NeilBrown
  1 sibling, 0 replies; 29+ messages in thread
From: NeilBrown @ 2017-05-23  1:07 UTC (permalink / raw)
  To: Nix; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5211 bytes --]

On Mon, May 22 2017, Nix wrote:

> [linux-bcache removed, not a bcache bug]
>
> On 22 May 2017, NeilBrown said this:
>
>> Congratulations.  You have found a bug that dates from 2011.
>
> Oh so not that old then! (My personal record, in a previous job, was a
> bug in allegedly critical functionality that entirely stopped it working
> and had been broken for around fourteen years. Like hell it was
> critical.)
>
>> Commit: 68866e425be2 ("MD: no sync IO while suspended")
>>
>> (I think).
>>
>> A write request gets to raid5_make_request() before mddev_suspend() sets
>> mddev->suspended.
>> raid5_make_request() needs md_write_start() to mark the array "dirty"
>> which it asks the md thread to do, but before the thread gets to the
>> task, mddev->suspend has been set, so md_check_recovery() doesn't update
>> the superblock, so md_write_start() doesn't proceed, so the request
>> never completes, so mddev_suspend() blocks indefinitely.
>
> Ooof. Looks like it's fsync()ing something -- not sure what, though.
> The previous time it was a metadata update...
>
> Hm. I note that ->events is meant to be for things like device additions
> and array starts. RAID5 -> RAID6 reshaping appears to push this up a
> lot:
>
>          Events : 1475948
>
> (it was at around 4000 before this -- itself rather mystifying, given
> that I've only rebooted this machine 27 times, and the array's probably
> been assembled less than 20 times, and was created with the same number
> of devices it has now.)

Originally, Events was for any metadata change, including
  clean <-> dirty
transitions.  This kept waking up spares though, so I change to
change by -1 when changing from dirty to clean, and not bother with
telling the spares.  That was a little problematic too, so no
it always increments except f or dirty->clean transitions where there
are spares.

For a reshape, the reshape is checkpointed quite frequently, and each
checkpoint increases the event counter.

>
> The other array has only 92 events (it was at 30-something before I
> tried this reshape).
>
>> I think that md_check_recovery() need to test for mddev->suspended
>> somewhere else.
>> It needs to allow superblock updates, and probably needs to reap the
>> recovery thread, but must not start a new recovery thread.
>
> My question would be, is all the intricate stuff md_check_recovery() is
> doing valid on an mddev that's not only suspended but detached? Because
> we detach right after the suspension in level_store()...

mddev_detach() kills the thread.  After it completes,
md_check_recovery() cannot possibly run.

However I now see that my patch was insufficient.
mddev_suspend() is called while mddev_lock() is held, and that prevents
md_check_recovery() from writing the superblock, so it can still
deadlock.
This will require more careful analysis.

>
> (btw, you probably want to remove the comment above enum array_state
> that says that "suspended" is "not supported yet", too.)

Yeah.... though it isn't supported by all personalities I think.

>
>> Probably something like this:
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index f6ae1d67bcd0..dbca31be22a1 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
>>   */
>>  void md_check_recovery(struct mddev *mddev)
>>  {
>> -	if (mddev->suspended)
>> -		return;
>>  
>>  	if (mddev->bitmap)
>>  		bitmap_daemon_work(mddev);
>> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
>>  		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
>>  
>>  		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
>> +		    mddev->suspended ||
>>  		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
>>  			goto not_running;
>>  		/* no recovery is running.
>>
>> though it's late so don't trust anything I write.
>
> This may end up clearing MD_RECOVERY_INTR and MD_RECOVERY_DONE, but I
> guess in this context it's safe to do so... looks OK otherwise to the
> totally ignorant fool reading this patch!
>
> Hm, being picky now, md_check_recovery() might be slightly easier to
> read if that governing mutex_trylock() conditional was inverted:
>
> 	if (mddev_trylock(mddev))
> 		return;
>
> Then you could de-indent most of the function by one indentation
> level...

You could send a patch if you liked.... but as it looks like I have more
work to do there, I'll probably do it.

>
>> If you try again it will almost certainly succeed.  I suspect this is a
>> hard race to hit - well done!!!
>
> I'll give it a try -- I hit it twice in succession, once with a
> --backup-file, once without. Since mdadm does not warn about the lack of
> a --backup-file, I guess the statement in the manual that it is
> essential to provide one when changing RAID levels is untrue: I suspect
> that it's necessary *if* you're not increasing the number of disks at
> the same time, but since I'm growing into a spare, adding a
> --backup-file only slows it down?
>
> I might run a backup first though. :)

Always a good idea, if you can.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 20:43       ` Nix
@ 2017-05-23  1:20         ` NeilBrown
  2017-05-23 10:10           ` Nix
  0 siblings, 1 reply; 29+ messages in thread
From: NeilBrown @ 2017-05-23  1:20 UTC (permalink / raw)
  To: Nix, Wols Lists; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2548 bytes --]

On Mon, May 22 2017, Nix wrote:

> On 22 May 2017, Wols Lists verbalised:
>
>> aiui, provided you're using a v1 superblock, the data offset means there
>> is spare space on the drives precisely for the purpose (one of then at
>> least) of keeping a backup. So the reshape will start reshaping into the
>> spare space and eliminate the need for the backup - the new version of
>> the stripe will be safely written before the space occupied by the old
>> stripe is required.
>
> But it's only a few KiB by default! The amount of seeking needed to
> reshape with such a small intermediate would be fairly horrific. (It was
> bad enough as it was: the reshape of 7TiB took more than two days,
> running at under 15MiB/s, though the component drives can all handle
> 220MiB/s easily. The extra time was spent seeking to and from the
> backup, it seems.)

If the space before were "only a few KiB", it wouldn't be used.
You need at least 1 full stripe, typically more.
Current mdadm leaves several megabytes I think.

>
>> Typically, the new layout will move the data offset, pushing the data
>> block up or down.
>
> Will it? Why? You'd need to move the data offset quite a long way to
> make enough space for the 500MiB backup file I saw being created during
> my last reshape.
>
> (Though, for me, this is somewhat redundant:
>
>     Data Offset : 262144 sectors
>
> because I set a data offset specifically to allow reshaping without
> having to move anything. Is the idea that reshaping that adds data
> spindles will move the data offset such that it is (still) on a chunk or
> stripe multiple? That's neat, if so, and means I wasted 128MiB on this,
> uh, 12TiB array. OK I'm not terribly blown away by this, particularly
> given that I'm wasting the same again inside the bcache partition for
> the same reason: I'm sure mdadm won't move *that* data offset.)

Data offset is always moved by a multiple of the chunk size.
When I create a 12-device raid5 on 1TB devices, then examine one of them,
it says:

    Data Offset : 262144 sectors
   Unused Space : before=262064 sectors, after=0 sectors
     Chunk Size : 512K

so there is 130Megabytes of space per device, enough for 255 chunks.
When mdadm moves the Data Offset to allow a reshape to happen without a
backup file, it aims to use half the available space.  So it would use
about 60Meg in about 120 chunks or 720Meg total across all devices.
This is more than the 500MiB backup file you saw.

NeilBrown

>
> -- 
> NULL && (void)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 19:07     ` Wols Lists
  2017-05-22 20:43       ` Nix
@ 2017-05-23  1:39       ` NeilBrown
  2017-05-23 14:47         ` Wols Lists
  1 sibling, 1 reply; 29+ messages in thread
From: NeilBrown @ 2017-05-23  1:39 UTC (permalink / raw)
  To: Wols Lists, Nix; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3565 bytes --]

On Mon, May 22 2017, Wols Lists wrote:

> On 22/05/17 16:30, Nix wrote:
>> I'll give it a try -- I hit it twice in succession, once with a
>> --backup-file, once without. Since mdadm does not warn about the lack of
>> a --backup-file, I guess the statement in the manual that it is
>> essential to provide one when changing RAID levels is untrue: I suspect
>> that it's necessary *if* you're not increasing the number of disks at
>> the same time, but since I'm growing into a spare, adding a
>> --backup-file only slows it down?
>
> I did discuss this with Neil while I wrote it, so I hope I got it right :-)
>
> https://raid.wiki.kernel.org/index.php/A_guide_to_mdadm#Array_internals_and_how_it_affects_mdadm
>
> aiui, provided you're using a v1 superblock, the data offset means there
> is spare space on the drives precisely for the purpose (one of then at
> least) of keeping a backup. So the reshape will start reshaping into the
> spare space and eliminate the need for the backup - the new version of
> the stripe will be safely written before the space occupied by the old
> stripe is required.
>
> Cheers,
> Wol

Proof-reading time.  I'll be very picky.  You'll ignore places where you
think my pickiness isn't helpful.

> The first arrays did not have a superblock, and were declared in
> mdadm.conf.

The first arrays (linear and raid0 only) did not have a superblock and
were declared in "raidtab" which was managed by the deprecated
"raid-tools" which have been replaced by mdadm.

mdadm does now allow you to declare no-superblock arrays in mdadm.
You can build them with "mdadm --build .....", but that is all.

> and the backup area when reshaping an array.

Not quite.  The whole point of being able to move the data offset
is that backup isn't needed.  Maybe it could read:

 ... and some extra space so that the data areas can be moved forward or
 backward on the device by several stripes.  This is useful when
 reshaping.

> All operations that involve moving data around are called reshapes,
> and require some form of backup.

 All operations that involve moving data around are called reshapes and
 need to be careful only to write to a region of the device which
 can safely be corrupted.  If a system crash happens during a reshape,
 the region that was being written to must be considered to contain
 garbage.
 The preferred mechanism is to relocate the data by a few stripes,
 reading data from a region of the disk that contains live data, and
 writing it into a sliding window that is otherwise unused.  After a few
 stripes have been copied, the metadata in the superblock is updated, so
 that the newly written data is now "live" and the location it was
 copied from is now "unused" and can have more stripes copied into it.
 This process results in the Data Offset being moved by several
 megabytes, either forward or backward, and it is to allow for this to
 happen that the Data Offset is set to several to many megabytes when an
 array is created.

 If it is not possible to move the Data Offset, either because there is
 no room or because the v0.90 superblock is in use, mdadm will take a
 backup of the region of data being reshaped before it allows the
 reshape to progress.  In the event of a crash, mdadm will restore this
 data over the potentially corrupted region of the array before starting
 the array.

Your current text already covers some of this, and there probably isn't a
need to replace it all.
But I think it is important not to talk about the reserved space in the
devices a backup space.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-23  1:20         ` NeilBrown
@ 2017-05-23 10:10           ` Nix
  0 siblings, 0 replies; 29+ messages in thread
From: Nix @ 2017-05-23 10:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wols Lists, linux-raid

On 23 May 2017, NeilBrown outgrape:

> On Mon, May 22 2017, Nix wrote:
>
>> On 22 May 2017, Wols Lists verbalised:
>>
>> But it's only a few KiB by default! The amount of seeking needed to
>> reshape with such a small intermediate would be fairly horrific. (It was
>> bad enough as it was: the reshape of 7TiB took more than two days,
>> running at under 15MiB/s, though the component drives can all handle
>> 220MiB/s easily. The extra time was spent seeking to and from the
>> backup, it seems.)
>
> If the space before were "only a few KiB", it wouldn't be used.
> You need at least 1 full stripe, typically more.
> Current mdadm leaves several megabytes I think.

I was about to protest and say "oh but it doesn't"... but it helps if
I'm looking at the right machine. It does, but it didn't in 2009 :)

>> spindles will move the data offset such that it is (still) on a chunk or
>> stripe multiple? That's neat, if so, and means I wasted 128MiB on this,
>> uh, 12TiB array. OK I'm not terribly blown away by this, particularly
>> given that I'm wasting the same again inside the bcache partition for
>> the same reason: I'm sure mdadm won't move *that* data offset.)
>
> Data offset is always moved by a multiple of the chunk size.

Right, so the overlying fs might not be doing full-stripe writes after a
reshape, even if it thinks it is, but it will certainly be doing
chunk-multiple writes.

> When I create a 12-device raid5 on 1TB devices, then examine one of them,
> it says:
>
>     Data Offset : 262144 sectors
>    Unused Space : before=262064 sectors, after=0 sectors
>      Chunk Size : 512K

(which is 21.3... stripes.)

> so there is 130Megabytes of space per device, enough for 255 chunks.
> When mdadm moves the Data Offset to allow a reshape to happen without a
> backup file, it aims to use half the available space.  So it would use
> about 60Meg in about 120 chunks or 720Meg total across all devices.
> This is more than the 500MiB backup file you saw.

Right. The message in the manpage saying that backup files are required
for a level change is obsolete, then (and I probably slowed down my last
reshape by specifying one, since seeking to the backup file at the other
end of the disk would have been *much* slower than seeking to that slack
space before the data offset).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 21:38   ` Nix
@ 2017-05-23 14:16     ` Wols Lists
  2017-05-23 15:00       ` Nix
  2017-05-24  1:24     ` NeilBrown
  1 sibling, 1 reply; 29+ messages in thread
From: Wols Lists @ 2017-05-23 14:16 UTC (permalink / raw)
  To: Nix, NeilBrown; +Cc: linux-raid

On 22/05/17 22:38, Nix wrote:
> Everything else hangs the same way, too. This was surprising enough that
> I double-checked to be sure the patch was applied: it was. I suspect the
> deadlock is somewhat different than you supposed... (and quite possibly
> not a race at all, or I wouldn't be hitting it so consistently, every
> time. I mean, I only need to miss it *once* and I'll have reshaped... :) )
> 
> It seems I can reproduce this on demand, so if you want to throw a patch
> with piles of extra printks my way, feel free.

Could this be the long-standing bug that hangs raid5 and raid6 reshapes?
I don't remember it being found and fixed. iirc (it's too long ago) it
hit people adding a drive to a raid5 (occasionally a raid6), and was
fairly easy to get round. Again it's too long ago but stopping and
starting the reshape fixed it usually iirc.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-23  1:39       ` NeilBrown
@ 2017-05-23 14:47         ` Wols Lists
  2017-05-24  1:50           ` NeilBrown
  0 siblings, 1 reply; 29+ messages in thread
From: Wols Lists @ 2017-05-23 14:47 UTC (permalink / raw)
  To: NeilBrown, Nix; +Cc: linux-raid

On 23/05/17 02:39, NeilBrown wrote:
> On Mon, May 22 2017, Wols Lists wrote:
> 
>> On 22/05/17 16:30, Nix wrote:
>>> I'll give it a try -- I hit it twice in succession, once with a
>>> --backup-file, once without. Since mdadm does not warn about the lack of
>>> a --backup-file, I guess the statement in the manual that it is
>>> essential to provide one when changing RAID levels is untrue: I suspect
>>> that it's necessary *if* you're not increasing the number of disks at
>>> the same time, but since I'm growing into a spare, adding a
>>> --backup-file only slows it down?
>>
>> I did discuss this with Neil while I wrote it, so I hope I got it right :-)
>>
>> https://raid.wiki.kernel.org/index.php/A_guide_to_mdadm#Array_internals_and_how_it_affects_mdadm
>>
>> aiui, provided you're using a v1 superblock, the data offset means there
>> is spare space on the drives precisely for the purpose (one of then at
>> least) of keeping a backup. So the reshape will start reshaping into the
>> spare space and eliminate the need for the backup - the new version of
>> the stripe will be safely written before the space occupied by the old
>> stripe is required.
>>
>> Cheers,
>> Wol
> 
> Proof-reading time.  I'll be very picky.  You'll ignore places where you
> think my pickiness isn't helpful.
> 
>> The first arrays did not have a superblock, and were declared in
>> mdadm.conf.
> 
> The first arrays (linear and raid0 only) did not have a superblock and
> were declared in "raidtab" which was managed by the deprecated
> "raid-tools" which have been replaced by mdadm.
> 
> mdadm does now allow you to declare no-superblock arrays in mdadm.
> You can build them with "mdadm --build .....", but that is all.
> 
>> and the backup area when reshaping an array.
> 
> Not quite.  The whole point of being able to move the data offset
> is that backup isn't needed.  Maybe it could read:
> 
>  ... and some extra space so that the data areas can be moved forward or
>  backward on the device by several stripes.  This is useful when
>  reshaping.
> 
>> All operations that involve moving data around are called reshapes,
>> and require some form of backup.
> 
>  All operations that involve moving data around are called reshapes and
>  need to be careful only to write to a region of the device which
>  can safely be corrupted.  If a system crash happens during a reshape,
>  the region that was being written to must be considered to contain
>  garbage.
>  The preferred mechanism is to relocate the data by a few stripes,
>  reading data from a region of the disk that contains live data, and
>  writing it into a sliding window that is otherwise unused.  After a few
>  stripes have been copied, the metadata in the superblock is updated, so
>  that the newly written data is now "live" and the location it was
>  copied from is now "unused" and can have more stripes copied into it.
>  This process results in the Data Offset being moved by several
>  megabytes, either forward or backward, and it is to allow for this to
>  happen that the Data Offset is set to several to many megabytes when an
>  array is created.
> 
>  If it is not possible to move the Data Offset, either because there is
>  no room or because the v0.90 superblock is in use, mdadm will take a
>  backup of the region of data being reshaped before it allows the
>  reshape to progress.  In the event of a crash, mdadm will restore this
>  data over the potentially corrupted region of the array before starting
>  the array.
> 
> Your current text already covers some of this, and there probably isn't a
> need to replace it all.
> But I think it is important not to talk about the reserved space in the
> devices a backup space.
> 
Thanks. I've updated the section. I've taken pretty much everything
you've said, but rewritten it in my own words, so that the page remains
consistent in style. I hope I haven't missed anything.

It's thrown up two little points I've noted on the page - if a v1.0
mirror has its offset changed, does this break the linux boot? (that's
the "without an initramfs" boot that doesn't assemble the mirror until
after the kernel is running).

And if you shut down cleanly during a backup-file reshape, is this fact
noted or does mdadm just assume any shutdown is dirty and use the backup
file anyway?

Cheers,
Wol


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-23 14:16     ` Wols Lists
@ 2017-05-23 15:00       ` Nix
  0 siblings, 0 replies; 29+ messages in thread
From: Nix @ 2017-05-23 15:00 UTC (permalink / raw)
  To: Wols Lists; +Cc: NeilBrown, linux-raid

On 23 May 2017, Wols Lists told this:

> On 22/05/17 22:38, Nix wrote:
>> Everything else hangs the same way, too. This was surprising enough that
>> I double-checked to be sure the patch was applied: it was. I suspect the
>> deadlock is somewhat different than you supposed... (and quite possibly
>> not a race at all, or I wouldn't be hitting it so consistently, every
>> time. I mean, I only need to miss it *once* and I'll have reshaped... :) )
>> 
>> It seems I can reproduce this on demand, so if you want to throw a patch
>> with piles of extra printks my way, feel free.
>
> Could this be the long-standing bug that hangs raid5 and raid6 reshapes?
> I don't remember it being found and fixed. iirc (it's too long ago) it
> hit people adding a drive to a raid5 (occasionally a raid6), and was
> fairly easy to get round. Again it's too long ago but stopping and
> starting the reshape fixed it usually iirc.

It hangs access to the whole drive irretrievably, immediately on
starting the reshape (it can't happen at any later time because it is
associated with writing out the level change itself: thank goodness, it
happens *before* the write, not afterwards). I don't know if the hangs
you refer to have similar symptoms.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-22 21:38   ` Nix
  2017-05-23 14:16     ` Wols Lists
@ 2017-05-24  1:24     ` NeilBrown
  2017-05-24 13:28       ` Nix
                         ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: NeilBrown @ 2017-05-24  1:24 UTC (permalink / raw)
  To: Nix; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5259 bytes --]

On Mon, May 22 2017, Nix wrote:

> On 22 May 2017, NeilBrown told this:
>
>> Probably something like this:
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index f6ae1d67bcd0..dbca31be22a1 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
>>   */
>>  void md_check_recovery(struct mddev *mddev)
>>  {
>> -	if (mddev->suspended)
>> -		return;
>>  
>>  	if (mddev->bitmap)
>>  		bitmap_daemon_work(mddev);
>> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
>>  		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
>>  
>>  		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
>> +		    mddev->suspended ||
>>  		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
>>  			goto not_running;
>>  		/* no recovery is running.
>>
>> though it's late so don't trust anything I write.
>>
>> If you try again it will almost certainly succeed.  I suspect this is a
>> hard race to hit - well done!!!
>
> Definitely not a hard race to hit :( I just hit it again with this
> patch.
>
> Absolutely identical hang:
>
> [  495.833520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  495.840618] mdadm           D    0  2700   2537 0x00000000
> [  495.847762] Call Trace:
> [  495.854825]  __schedule+0x290/0x810
> [  495.861905]  schedule+0x36/0x80
> [  495.868934]  mddev_suspend+0xb3/0xe0
> [  495.875926]  ? wake_atomic_t_function+0x60/0x60
> [  495.882976]  level_store+0x1a7/0x6c0
> [  495.889953]  ? md_ioctl+0xb7/0x1c10
> [  495.896901]  ? putname+0x53/0x60
> [  495.903807]  md_attr_store+0x83/0xc0
> [  495.910684]  sysfs_kf_write+0x37/0x40
> [  495.917547]  kernfs_fop_write+0x110/0x1a0
> [  495.924429]  __vfs_write+0x28/0x120
> [  495.931270]  ? kernfs_iop_get_link+0x172/0x1e0
> [  495.938126]  ? __alloc_fd+0x3f/0x170
> [  495.944906]  vfs_write+0xb6/0x1d0
> [  495.951646]  SyS_write+0x46/0xb0
> [  495.958338]  entry_SYSCALL_64_fastpath+0x13/0x94
>
> Everything else hangs the same way, too. This was surprising enough that
> I double-checked to be sure the patch was applied: it was. I suspect the
> deadlock is somewhat different than you supposed... (and quite possibly
> not a race at all, or I wouldn't be hitting it so consistently, every
> time. I mean, I only need to miss it *once* and I'll have reshaped... :) )
>
> It seems I can reproduce this on demand, so if you want to throw a patch
> with piles of extra printks my way, feel free.

Did you have md_write_start being called by syslog-ng again?
I wonder what syslog is logging - presumably something about the reshape
starting.
If you kill syslog-ng, can you start the reshape?

Alternately, this might do it.
I think the root problem is that it isn't safe to call mddev_suspend()
while holding the reconfig_mutex.
For complete safety I probably need to move the request_module() call
earlier, as that could block if a device was suspended (no memory
allocation allowed while device is suspended).

Thanks,
NeilBrown



diff --git a/drivers/md/md.c b/drivers/md/md.c
index 10367ffe92e3..a7b9c0576479 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -324,8 +324,12 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 void mddev_suspend(struct mddev *mddev)
 {
 	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
+
 	if (mddev->suspended++)
 		return;
+#ifdef CONFIG_LOCKDEP
+	WARN_ON_ONCE(debug_locks && lockdep_is_held(&mddev->reconfig_mutex));
+#endif
 	synchronize_rcu();
 	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
 	mddev->pers->quiesce(mddev, 1);
@@ -3594,9 +3598,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	if (slen == 0 || slen >= sizeof(clevel))
 		return -EINVAL;
 
+	mddev_suspend(mddev);
 	rv = mddev_lock(mddev);
-	if (rv)
+	if (rv) {
+		mddev_resume(mddev);
 		return rv;
+	}
 
 	if (mddev->pers == NULL) {
 		strncpy(mddev->clevel, buf, slen);
@@ -3687,7 +3694,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	}
 
 	/* Looks like we have a winner */
-	mddev_suspend(mddev);
 	mddev_detach(mddev);
 
 	spin_lock(&mddev->lock);
@@ -3771,13 +3777,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	blk_set_stacking_limits(&mddev->queue->limits);
 	pers->run(mddev);
 	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
-	mddev_resume(mddev);
 	if (!mddev->thread)
 		md_update_sb(mddev, 1);
 	sysfs_notify(&mddev->kobj, NULL, "level");
 	md_new_event(mddev);
 	rv = len;
 out_unlock:
+	mddev_resume(mddev);
 	mddev_unlock(mddev);
 	return rv;
 }
@@ -4490,6 +4496,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
 		int err;
 		if (mddev->pers->start_reshape == NULL)
 			return -EINVAL;
+		mddev_suspend(mddev);
 		err = mddev_lock(mddev);
 		if (!err) {
 			if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
@@ -4500,6 +4507,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
 			}
 			mddev_unlock(mddev);
 		}
+		mddev_resume(mddev);
 		if (err)
 			return err;
 		sysfs_notify(&mddev->kobj, NULL, "degraded");

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-23 14:47         ` Wols Lists
@ 2017-05-24  1:50           ` NeilBrown
  0 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2017-05-24  1:50 UTC (permalink / raw)
  To: Wols Lists, Nix; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1239 bytes --]

On Tue, May 23 2017, Wols Lists wrote:

> Thanks. I've updated the section. I've taken pretty much everything
> you've said, but rewritten it in my own words, so that the page remains
> consistent in style. I hope I haven't missed anything.
>
> It's thrown up two little points I've noted on the page - if a v1.0
> mirror has its offset changed, does this break the linux boot? (that's
> the "without an initramfs" boot that doesn't assemble the mirror until
> after the kernel is running).

Yes, if a v1.0 has a data offset that wasn't zero, you wouldn't be able
to boot of a member device.
But there is no good reason that I can think of to change the data
offset of a device in a RAID1.

>
> And if you shut down cleanly during a backup-file reshape, is this fact
> noted or does mdadm just assume any shutdown is dirty and use the backup
> file anyway?

The backup file contains UUID and version info.
When mdadm determines that part of the backup is no longer needed, it
invalidates the backup before enabling writes to that region of the
array.
So it only uses the file is there is good reason to suspect that it is
both necessary and safe. (It may not always be necessary.  It should
always be safe).

NeilBrown


>
> Cheers,
> Wol

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-24  1:24     ` NeilBrown
@ 2017-05-24 13:28       ` Nix
  2017-05-25  1:31         ` NeilBrown
  2017-05-24 19:42       ` Nix
  2017-05-24 22:57       ` Shaohua Li
  2 siblings, 1 reply; 29+ messages in thread
From: Nix @ 2017-05-24 13:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 24 May 2017, NeilBrown uttered the following:

> On Mon, May 22 2017, Nix wrote:
>
>> Everything else hangs the same way, too. This was surprising enough that
>> I double-checked to be sure the patch was applied: it was. I suspect the
>> deadlock is somewhat different than you supposed... (and quite possibly
>> not a race at all, or I wouldn't be hitting it so consistently, every
>> time. I mean, I only need to miss it *once* and I'll have reshaped... :) )
>>
>> It seems I can reproduce this on demand, so if you want to throw a patch
>> with piles of extra printks my way, feel free.
>
> Did you have md_write_start being called by syslog-ng again?

Yeah.

> I wonder what syslog is logging - presumably something about the reshape
> starting.

Almost certainly.

> If you kill syslog-ng, can you start the reshape?

If I kill syslog-ng the entire network gets very unhappy and most of
userspace across all of it blocks solid waiting for a syslog-ng that
isn't there in very little time. It's the primary log host... :/ I might
switch back to the old log host (primary until three weeks ago) and try
again, but honestly the amount of ongoing traffic on this array is such
that I suspect *something* will creep in no matter what you do. (The
only reason process accounting's not going there is because I'm dumping
it on the RAID-0 array I already reshaped.)

(e.g. this time I also saw write traffic from mysqld. God knows what it
was doing: the database there is about the most idle database ever --
which is why I don't care about its being on RAID-5/6 -- and I know for
sure that it currently has a grand total of zero clients connected.)

Plus there's the usual pile of ongoing "you are still breathing so I'll
do more backlogged stuff" XFS metadata updates, rmap traffic, etc.

> Alternately, this might do it.
> I think the root problem is that it isn't safe to call mddev_suspend()
> while holding the reconfig_mutex.
> For complete safety I probably need to move the request_module() call
> earlier, as that could block if a device was suspended (no memory
> allocation allowed while device is suspended).

I'll give this a try!

(I'm not sure what to do if it *works* -- how do I test any later
changes? I might reshape back to RAID-5 + spare again just so I can test
later stuff, but that would take ages: the array is over 14TiB...)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-24  1:24     ` NeilBrown
  2017-05-24 13:28       ` Nix
@ 2017-05-24 19:42       ` Nix
  2017-05-24 22:57       ` Shaohua Li
  2 siblings, 0 replies; 29+ messages in thread
From: Nix @ 2017-05-24 19:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 24 May 2017, NeilBrown said:
> Alternately, this might do it.

Bingo! Ticking away nicely:

[  147.538274] md/raid:md125: device sda3 operational as raid disk 0
[  147.574356] md/raid:md125: device sdd3 operational as raid disk 3
[  147.586482] md/raid:md125: device sdc3 operational as raid disk 2
[  147.598571] md/raid:md125: device sdb3 operational as raid disk 1
[  147.613949] md/raid:md125: raid level 6 active with 4 out of 5 devices, algorithm 18
[  147.776155] md: reshape of RAID array md125

md125 : active raid6 sda3[0] sdf3[5] sdd3[4] sdc3[2] sdb3[1]
      15391689216 blocks super 1.2 level 6, 512k chunk, algorithm 18 [5/4] [UUUU_]
      [>....................]  reshape =  0.0% (2669056/5130563072) finish=1363.5min speed=62678K/sec

(wow, this is a lot faster: only 1363 min / 70MiB/s, versus the
--backup-file one, for an array half the size, that was >2500 min,
12MiB/s. I mean, yes, it was at the slow end of the disk, but they don't
differ in speed *that* much.)

Machine still perfectly responsive, as if on reshape was happening at
all. (But then, I've got used to that. The last time I saw any md resync-
induced delays was long ago when I had a sym53c75 in the loop with an
I/O rate of 10MiB/s. I think delays are unavoidable with something
that slow in there.)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-24  1:24     ` NeilBrown
  2017-05-24 13:28       ` Nix
  2017-05-24 19:42       ` Nix
@ 2017-05-24 22:57       ` Shaohua Li
  2017-05-25  1:30         ` NeilBrown
  2017-05-28 23:17         ` NeilBrown
  2 siblings, 2 replies; 29+ messages in thread
From: Shaohua Li @ 2017-05-24 22:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: Nix, linux-raid

On Wed, May 24, 2017 at 11:24:21AM +1000, Neil Brown wrote:
> On Mon, May 22 2017, Nix wrote:
> 
> > On 22 May 2017, NeilBrown told this:
> >
> >> Probably something like this:
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index f6ae1d67bcd0..dbca31be22a1 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
> >>   */
> >>  void md_check_recovery(struct mddev *mddev)
> >>  {
> >> -	if (mddev->suspended)
> >> -		return;
> >>  
> >>  	if (mddev->bitmap)
> >>  		bitmap_daemon_work(mddev);
> >> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
> >>  		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
> >>  
> >>  		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> >> +		    mddev->suspended ||
> >>  		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
> >>  			goto not_running;
> >>  		/* no recovery is running.
> >>
> >> though it's late so don't trust anything I write.
> >>
> >> If you try again it will almost certainly succeed.  I suspect this is a
> >> hard race to hit - well done!!!
> >
> > Definitely not a hard race to hit :( I just hit it again with this
> > patch.
> >
> > Absolutely identical hang:
> >
> > [  495.833520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  495.840618] mdadm           D    0  2700   2537 0x00000000
> > [  495.847762] Call Trace:
> > [  495.854825]  __schedule+0x290/0x810
> > [  495.861905]  schedule+0x36/0x80
> > [  495.868934]  mddev_suspend+0xb3/0xe0
> > [  495.875926]  ? wake_atomic_t_function+0x60/0x60
> > [  495.882976]  level_store+0x1a7/0x6c0
> > [  495.889953]  ? md_ioctl+0xb7/0x1c10
> > [  495.896901]  ? putname+0x53/0x60
> > [  495.903807]  md_attr_store+0x83/0xc0
> > [  495.910684]  sysfs_kf_write+0x37/0x40
> > [  495.917547]  kernfs_fop_write+0x110/0x1a0
> > [  495.924429]  __vfs_write+0x28/0x120
> > [  495.931270]  ? kernfs_iop_get_link+0x172/0x1e0
> > [  495.938126]  ? __alloc_fd+0x3f/0x170
> > [  495.944906]  vfs_write+0xb6/0x1d0
> > [  495.951646]  SyS_write+0x46/0xb0
> > [  495.958338]  entry_SYSCALL_64_fastpath+0x13/0x94
> >
> > Everything else hangs the same way, too. This was surprising enough that
> > I double-checked to be sure the patch was applied: it was. I suspect the
> > deadlock is somewhat different than you supposed... (and quite possibly
> > not a race at all, or I wouldn't be hitting it so consistently, every
> > time. I mean, I only need to miss it *once* and I'll have reshaped... :) )
> >
> > It seems I can reproduce this on demand, so if you want to throw a patch
> > with piles of extra printks my way, feel free.
> 
> Did you have md_write_start being called by syslog-ng again?
> I wonder what syslog is logging - presumably something about the reshape
> starting.
> If you kill syslog-ng, can you start the reshape?
> 
> Alternately, this might do it.
> I think the root problem is that it isn't safe to call mddev_suspend()
> while holding the reconfig_mutex.
> For complete safety I probably need to move the request_module() call
> earlier, as that could block if a device was suspended (no memory
> allocation allowed while device is suspended).
> 
> Thanks,
> NeilBrown
> 
> 
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 10367ffe92e3..a7b9c0576479 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -324,8 +324,12 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  void mddev_suspend(struct mddev *mddev)
>  {
>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
> +
>  	if (mddev->suspended++)
>  		return;
> +#ifdef CONFIG_LOCKDEP
> +	WARN_ON_ONCE(debug_locks && lockdep_is_held(&mddev->reconfig_mutex));
> +#endif
>  	synchronize_rcu();
>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
>  	mddev->pers->quiesce(mddev, 1);
> @@ -3594,9 +3598,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  	if (slen == 0 || slen >= sizeof(clevel))
>  		return -EINVAL;
>  
> +	mddev_suspend(mddev);
>  	rv = mddev_lock(mddev);
> -	if (rv)
> +	if (rv) {
> +		mddev_resume(mddev);
>  		return rv;
> +	}
>  
>  	if (mddev->pers == NULL) {
>  		strncpy(mddev->clevel, buf, slen);
> @@ -3687,7 +3694,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  	}
>  
>  	/* Looks like we have a winner */
> -	mddev_suspend(mddev);
>  	mddev_detach(mddev);
>  
>  	spin_lock(&mddev->lock);
> @@ -3771,13 +3777,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  	blk_set_stacking_limits(&mddev->queue->limits);
>  	pers->run(mddev);
>  	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> -	mddev_resume(mddev);
>  	if (!mddev->thread)
>  		md_update_sb(mddev, 1);
>  	sysfs_notify(&mddev->kobj, NULL, "level");
>  	md_new_event(mddev);
>  	rv = len;
>  out_unlock:
> +	mddev_resume(mddev);
>  	mddev_unlock(mddev);
>  	return rv;
>  }
> @@ -4490,6 +4496,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>  		int err;
>  		if (mddev->pers->start_reshape == NULL)
>  			return -EINVAL;
> +		mddev_suspend(mddev);
>  		err = mddev_lock(mddev);
>  		if (!err) {
>  			if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> @@ -4500,6 +4507,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>  			}
>  			mddev_unlock(mddev);
>  		}
> +		mddev_resume(mddev);
>  		if (err)
>  			return err;
>  		sysfs_notify(&mddev->kobj, NULL, "degraded");

The analysis makes a lot of sense, thanks! The patch looks not solving the
problem though, because check_recovery will not write super if suspended isn't
0.

I'm thinking of alternative wayt to fix this issue. A request which stalls in
md_write_start reallys isn't an active IO. We could add another counter in
md_write_start to record stalled request there. Then in mddev_suspend, wait
event will wait for atomic_read(&mddev->active_io) == the_new_counter.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-24 22:57       ` Shaohua Li
@ 2017-05-25  1:30         ` NeilBrown
  2017-05-25  1:46           ` Shaohua Li
  2017-05-28 23:17         ` NeilBrown
  1 sibling, 1 reply; 29+ messages in thread
From: NeilBrown @ 2017-05-25  1:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Nix, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3449 bytes --]

On Wed, May 24 2017, Shaohua Li wrote:

> On Wed, May 24, 2017 at 11:24:21AM +1000, Neil Brown wrote:
>> 
>> 
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 10367ffe92e3..a7b9c0576479 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -324,8 +324,12 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>>  void mddev_suspend(struct mddev *mddev)
>>  {
>>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
>> +
>>  	if (mddev->suspended++)
>>  		return;
>> +#ifdef CONFIG_LOCKDEP
>> +	WARN_ON_ONCE(debug_locks && lockdep_is_held(&mddev->reconfig_mutex));
>> +#endif
>>  	synchronize_rcu();
>>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
>>  	mddev->pers->quiesce(mddev, 1);
>> @@ -3594,9 +3598,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>>  	if (slen == 0 || slen >= sizeof(clevel))
>>  		return -EINVAL;
>>  
>> +	mddev_suspend(mddev);
>>  	rv = mddev_lock(mddev);
>> -	if (rv)
>> +	if (rv) {
>> +		mddev_resume(mddev);
>>  		return rv;
>> +	}
>>  
>>  	if (mddev->pers == NULL) {
>>  		strncpy(mddev->clevel, buf, slen);
>> @@ -3687,7 +3694,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>>  	}
>>  
>>  	/* Looks like we have a winner */
>> -	mddev_suspend(mddev);
>>  	mddev_detach(mddev);
>>  
>>  	spin_lock(&mddev->lock);
>> @@ -3771,13 +3777,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>>  	blk_set_stacking_limits(&mddev->queue->limits);
>>  	pers->run(mddev);
>>  	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>> -	mddev_resume(mddev);
>>  	if (!mddev->thread)
>>  		md_update_sb(mddev, 1);
>>  	sysfs_notify(&mddev->kobj, NULL, "level");
>>  	md_new_event(mddev);
>>  	rv = len;
>>  out_unlock:
>> +	mddev_resume(mddev);
>>  	mddev_unlock(mddev);
>>  	return rv;
>>  }
>> @@ -4490,6 +4496,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>  		int err;
>>  		if (mddev->pers->start_reshape == NULL)
>>  			return -EINVAL;
>> +		mddev_suspend(mddev);
>>  		err = mddev_lock(mddev);
>>  		if (!err) {
>>  			if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>> @@ -4500,6 +4507,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>  			}
>>  			mddev_unlock(mddev);
>>  		}
>> +		mddev_resume(mddev);
>>  		if (err)
>>  			return err;
>>  		sysfs_notify(&mddev->kobj, NULL, "degraded");
>
> The analysis makes a lot of sense, thanks! The patch looks not solving the
> problem though, because check_recovery will not write super if suspended isn't
> 0.

Why does that matter?  In what case do you need the superblock to be
written, but it doesn't happen.

check_recovery won't write the superblock while the mddev is locked
either, and it is locked for most of level_store().
When level_store() finished, it unlocks the device and that will trigger
md_check_recovery() to be run, and the metadata will be written then.
I don't think there is a particular need to update it before then.

Thanks,
NeilBrown


>
> I'm thinking of alternative wayt to fix this issue. A request which stalls in
> md_write_start reallys isn't an active IO. We could add another counter in
> md_write_start to record stalled request there. Then in mddev_suspend, wait
> event will wait for atomic_read(&mddev->active_io) == the_new_counter.
>
> Thanks,
> Shaohua

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-24 13:28       ` Nix
@ 2017-05-25  1:31         ` NeilBrown
  2017-05-25 12:14           ` Nix
  0 siblings, 1 reply; 29+ messages in thread
From: NeilBrown @ 2017-05-25  1:31 UTC (permalink / raw)
  To: Nix; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 476 bytes --]

On Wed, May 24 2017, Nix wrote:
>
> (I'm not sure what to do if it *works* -- how do I test any later
> changes? I might reshape back to RAID-5 + spare again just so I can test
> later stuff, but that would take ages: the array is over 14TiB...)
>

If it works (which it did), then we'll have enough understanding that it
should be easy to simulate the problem accurately in a VM.
I don't think there is any need for you to experiment further on your
data.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-25  1:30         ` NeilBrown
@ 2017-05-25  1:46           ` Shaohua Li
  2017-05-26  3:23             ` NeilBrown
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2017-05-25  1:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Nix, linux-raid

On Thu, May 25, 2017 at 11:30:12AM +1000, Neil Brown wrote:
> On Wed, May 24 2017, Shaohua Li wrote:
> 
> > On Wed, May 24, 2017 at 11:24:21AM +1000, Neil Brown wrote:
> >> 
> >> 
> >> 
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index 10367ffe92e3..a7b9c0576479 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -324,8 +324,12 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
> >>  void mddev_suspend(struct mddev *mddev)
> >>  {
> >>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
> >> +
> >>  	if (mddev->suspended++)
> >>  		return;
> >> +#ifdef CONFIG_LOCKDEP
> >> +	WARN_ON_ONCE(debug_locks && lockdep_is_held(&mddev->reconfig_mutex));
> >> +#endif
> >>  	synchronize_rcu();
> >>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
> >>  	mddev->pers->quiesce(mddev, 1);
> >> @@ -3594,9 +3598,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >>  	if (slen == 0 || slen >= sizeof(clevel))
> >>  		return -EINVAL;
> >>  
> >> +	mddev_suspend(mddev);
> >>  	rv = mddev_lock(mddev);
> >> -	if (rv)
> >> +	if (rv) {
> >> +		mddev_resume(mddev);
> >>  		return rv;
> >> +	}
> >>  
> >>  	if (mddev->pers == NULL) {
> >>  		strncpy(mddev->clevel, buf, slen);
> >> @@ -3687,7 +3694,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >>  	}
> >>  
> >>  	/* Looks like we have a winner */
> >> -	mddev_suspend(mddev);
> >>  	mddev_detach(mddev);
> >>  
> >>  	spin_lock(&mddev->lock);
> >> @@ -3771,13 +3777,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >>  	blk_set_stacking_limits(&mddev->queue->limits);
> >>  	pers->run(mddev);
> >>  	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> >> -	mddev_resume(mddev);
> >>  	if (!mddev->thread)
> >>  		md_update_sb(mddev, 1);
> >>  	sysfs_notify(&mddev->kobj, NULL, "level");
> >>  	md_new_event(mddev);
> >>  	rv = len;
> >>  out_unlock:
> >> +	mddev_resume(mddev);
> >>  	mddev_unlock(mddev);
> >>  	return rv;
> >>  }
> >> @@ -4490,6 +4496,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> >>  		int err;
> >>  		if (mddev->pers->start_reshape == NULL)
> >>  			return -EINVAL;
> >> +		mddev_suspend(mddev);
> >>  		err = mddev_lock(mddev);
> >>  		if (!err) {
> >>  			if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> >> @@ -4500,6 +4507,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> >>  			}
> >>  			mddev_unlock(mddev);
> >>  		}
> >> +		mddev_resume(mddev);
> >>  		if (err)
> >>  			return err;
> >>  		sysfs_notify(&mddev->kobj, NULL, "degraded");
> >
> > The analysis makes a lot of sense, thanks! The patch looks not solving the
> > problem though, because check_recovery will not write super if suspended isn't
> > 0.
> 
> Why does that matter?  In what case do you need the superblock to be
> written, but it doesn't happen.
> 
> check_recovery won't write the superblock while the mddev is locked
> either, and it is locked for most of level_store().
> When level_store() finished, it unlocks the device and that will trigger
> md_check_recovery() to be run, and the metadata will be written then.
> I don't think there is a particular need to update it before then.

I get confused. md_write_start is waiting for MD_SB_CHANGE_PENDING cleared,
which is done in check_recovery. Your previous email describes this too.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-25  1:31         ` NeilBrown
@ 2017-05-25 12:14           ` Nix
  0 siblings, 0 replies; 29+ messages in thread
From: Nix @ 2017-05-25 12:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 25 May 2017, NeilBrown spake thusly:

> On Wed, May 24 2017, Nix wrote:
>>
>> (I'm not sure what to do if it *works* -- how do I test any later
>> changes? I might reshape back to RAID-5 + spare again just so I can test
>> later stuff, but that would take ages: the array is over 14TiB...)
>
> If it works (which it did), then we'll have enough understanding that it
> should be easy to simulate the problem accurately in a VM.
> I don't think there is any need for you to experiment further on your
> data.

If you need me to, I can -- this machine isn't commissioned yet (though
the plan *is* to do so next weekend, I can put it off: I've put it off
before...)

Plus I have plenty of backups. :)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-25  1:46           ` Shaohua Li
@ 2017-05-26  3:23             ` NeilBrown
  2017-05-26 16:40               ` Shaohua Li
  0 siblings, 1 reply; 29+ messages in thread
From: NeilBrown @ 2017-05-26  3:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Nix, linux-raid

[-- Attachment #1: Type: text/plain, Size: 7908 bytes --]

On Wed, May 24 2017, Shaohua Li wrote:

> On Thu, May 25, 2017 at 11:30:12AM +1000, Neil Brown wrote:
>> On Wed, May 24 2017, Shaohua Li wrote:
>> 
>> > On Wed, May 24, 2017 at 11:24:21AM +1000, Neil Brown wrote:
>> >> 
>> >> 
>> >> 
>> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> >> index 10367ffe92e3..a7b9c0576479 100644
>> >> --- a/drivers/md/md.c
>> >> +++ b/drivers/md/md.c
>> >> @@ -324,8 +324,12 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>> >>  void mddev_suspend(struct mddev *mddev)
>> >>  {
>> >>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
>> >> +
>> >>  	if (mddev->suspended++)
>> >>  		return;
>> >> +#ifdef CONFIG_LOCKDEP
>> >> +	WARN_ON_ONCE(debug_locks && lockdep_is_held(&mddev->reconfig_mutex));
>> >> +#endif
>> >>  	synchronize_rcu();
>> >>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
>> >>  	mddev->pers->quiesce(mddev, 1);
>> >> @@ -3594,9 +3598,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>> >>  	if (slen == 0 || slen >= sizeof(clevel))
>> >>  		return -EINVAL;
>> >>  
>> >> +	mddev_suspend(mddev);
>> >>  	rv = mddev_lock(mddev);
>> >> -	if (rv)
>> >> +	if (rv) {
>> >> +		mddev_resume(mddev);
>> >>  		return rv;
>> >> +	}
>> >>  
>> >>  	if (mddev->pers == NULL) {
>> >>  		strncpy(mddev->clevel, buf, slen);
>> >> @@ -3687,7 +3694,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>> >>  	}
>> >>  
>> >>  	/* Looks like we have a winner */
>> >> -	mddev_suspend(mddev);
>> >>  	mddev_detach(mddev);
>> >>  
>> >>  	spin_lock(&mddev->lock);
>> >> @@ -3771,13 +3777,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>> >>  	blk_set_stacking_limits(&mddev->queue->limits);
>> >>  	pers->run(mddev);
>> >>  	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>> >> -	mddev_resume(mddev);
>> >>  	if (!mddev->thread)
>> >>  		md_update_sb(mddev, 1);
>> >>  	sysfs_notify(&mddev->kobj, NULL, "level");
>> >>  	md_new_event(mddev);
>> >>  	rv = len;
>> >>  out_unlock:
>> >> +	mddev_resume(mddev);
>> >>  	mddev_unlock(mddev);
>> >>  	return rv;
>> >>  }
>> >> @@ -4490,6 +4496,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>> >>  		int err;
>> >>  		if (mddev->pers->start_reshape == NULL)
>> >>  			return -EINVAL;
>> >> +		mddev_suspend(mddev);
>> >>  		err = mddev_lock(mddev);
>> >>  		if (!err) {
>> >>  			if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>> >> @@ -4500,6 +4507,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>> >>  			}
>> >>  			mddev_unlock(mddev);
>> >>  		}
>> >> +		mddev_resume(mddev);
>> >>  		if (err)
>> >>  			return err;
>> >>  		sysfs_notify(&mddev->kobj, NULL, "degraded");
>> >
>> > The analysis makes a lot of sense, thanks! The patch looks not solving the
>> > problem though, because check_recovery will not write super if suspended isn't
>> > 0.
>> 
>> Why does that matter?  In what case do you need the superblock to be
>> written, but it doesn't happen.
>> 
>> check_recovery won't write the superblock while the mddev is locked
>> either, and it is locked for most of level_store().
>> When level_store() finished, it unlocks the device and that will trigger
>> md_check_recovery() to be run, and the metadata will be written then.
>> I don't think there is a particular need to update it before then.
>
> I get confused. md_write_start is waiting for MD_SB_CHANGE_PENDING cleared,
> which is done in check_recovery. Your previous email describes this too.

Uhmm.... yes.  I see your point now.  Maybe we need might first patch as
well, so md_check_recovery() can still update the superblock after
->suspended is set.

However, I starting looking at making sure mddev_suspend() was never
called with mddev_lock() held, and that isn't straight forward.
Most callers of mddev_suspend() do hold the lock.
The only exceptions I found were:
  dm-raid.c in raid_postsuspend()
  raid5-cache.c in r5c_disable_writeback_async() and
  r5c_journal_mode_set().

It might be easiest to change all of those to lock the mddev, then
do the md_update_sb in mddev_suspend, when needed.

i.e. something like the following.  Thoughts?

Thanks,
NeilBrown


diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 7d893228c40f..db79bd22418b 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3607,8 +3607,11 @@ static void raid_postsuspend(struct dm_target *ti)
 {
 	struct raid_set *rs = ti->private;
 
-	if (!rs->md.suspended)
+	if (!rs->md.suspended) {
+		mddev_lock_nointr(&rs->md);
 		mddev_suspend(&rs->md);
+		mddev_unlock(&rs->md);
+	}
 
 	rs->md.ro = 1;
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 10367ffe92e3..6cbb37a7d445 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -320,14 +320,22 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
  * are completely handled.
  * Once mddev_detach() is called and completes, the module will be
  * completely unused.
+ * The caller must hold the mddev_lock.
+ * mddev_suspend() will update the superblock if it
+ * turns out that something is waiting in md_write_start().
  */
 void mddev_suspend(struct mddev *mddev)
 {
 	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
 	if (mddev->suspended++)
 		return;
+	lockdep_assert_held(&mddev->reconfig_mutex);
+
 	synchronize_rcu();
-	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
+	wait_event_cmd(mddev->sb_wait, atomic_read(&mddev->active_io) == 0,
+		       if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+			       md_update_sb(mddev, 0),
+		);
 	mddev->pers->quiesce(mddev, 1);
 
 	del_timer_sync(&mddev->safemode_timer);
@@ -7971,6 +7979,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 			md_wakeup_thread(mddev->thread);
+			wake_up(&mddev->sb_wait);
 			did_change = 1;
 		}
 		spin_unlock(&mddev->lock);
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 4c00bc248287..c231c4a29903 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -682,13 +682,11 @@ static void r5c_disable_writeback_async(struct work_struct *work)
 	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
 		mdname(mddev));
 
-	/* wait superblock change before suspend */
-	wait_event(mddev->sb_wait,
-		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
-
+	mddev_lock_nointr(mddev);
 	mddev_suspend(mddev);
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
 	mddev_resume(mddev);
+	mddev_unlock(mddev);
 }
 
 static void r5l_submit_current_io(struct r5l_log *log)
@@ -2542,6 +2540,7 @@ int r5c_journal_mode_set(struct mddev *mddev, int mode)
 {
 	struct r5conf *conf = mddev->private;
 	struct r5l_log *log = conf->log;
+	int ret = 0;
 
 	if (!log)
 		return -ENODEV;
@@ -2550,17 +2549,20 @@ int r5c_journal_mode_set(struct mddev *mddev, int mode)
 	    mode > R5C_JOURNAL_MODE_WRITE_BACK)
 		return -EINVAL;
 
+	mddev_lock_nointr(mddev);
 	if (raid5_calc_degraded(conf) > 0 &&
 	    mode == R5C_JOURNAL_MODE_WRITE_BACK)
-		return -EINVAL;
-
-	mddev_suspend(mddev);
-	conf->log->r5c_journal_mode = mode;
-	mddev_resume(mddev);
+		ret = -EINVAL;
+	else {
+		mddev_suspend(mddev);
+		conf->log->r5c_journal_mode = mode;
+		mddev_resume(mddev);
 
-	pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n",
-		 mdname(mddev), mode, r5c_journal_mode_str[mode]);
-	return 0;
+		pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n",
+			 mdname(mddev), mode, r5c_journal_mode_str[mode]);
+	}
+	mddev_unlock(mddev);
+	return ret;
 }
 EXPORT_SYMBOL(r5c_journal_mode_set);
 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-26  3:23             ` NeilBrown
@ 2017-05-26 16:40               ` Shaohua Li
  0 siblings, 0 replies; 29+ messages in thread
From: Shaohua Li @ 2017-05-26 16:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Nix, linux-raid

On Fri, May 26, 2017 at 01:23:06PM +1000, Neil Brown wrote:
> On Wed, May 24 2017, Shaohua Li wrote:
> 
> > On Thu, May 25, 2017 at 11:30:12AM +1000, Neil Brown wrote:
> >> On Wed, May 24 2017, Shaohua Li wrote:
> >> 
> >> > On Wed, May 24, 2017 at 11:24:21AM +1000, Neil Brown wrote:
> >> >> 
> >> >> 
> >> >> 
> >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> >> index 10367ffe92e3..a7b9c0576479 100644
> >> >> --- a/drivers/md/md.c
> >> >> +++ b/drivers/md/md.c
> >> >> @@ -324,8 +324,12 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
> >> >>  void mddev_suspend(struct mddev *mddev)
> >> >>  {
> >> >>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
> >> >> +
> >> >>  	if (mddev->suspended++)
> >> >>  		return;
> >> >> +#ifdef CONFIG_LOCKDEP
> >> >> +	WARN_ON_ONCE(debug_locks && lockdep_is_held(&mddev->reconfig_mutex));
> >> >> +#endif
> >> >>  	synchronize_rcu();
> >> >>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
> >> >>  	mddev->pers->quiesce(mddev, 1);
> >> >> @@ -3594,9 +3598,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >> >>  	if (slen == 0 || slen >= sizeof(clevel))
> >> >>  		return -EINVAL;
> >> >>  
> >> >> +	mddev_suspend(mddev);
> >> >>  	rv = mddev_lock(mddev);
> >> >> -	if (rv)
> >> >> +	if (rv) {
> >> >> +		mddev_resume(mddev);
> >> >>  		return rv;
> >> >> +	}
> >> >>  
> >> >>  	if (mddev->pers == NULL) {
> >> >>  		strncpy(mddev->clevel, buf, slen);
> >> >> @@ -3687,7 +3694,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >> >>  	}
> >> >>  
> >> >>  	/* Looks like we have a winner */
> >> >> -	mddev_suspend(mddev);
> >> >>  	mddev_detach(mddev);
> >> >>  
> >> >>  	spin_lock(&mddev->lock);
> >> >> @@ -3771,13 +3777,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >> >>  	blk_set_stacking_limits(&mddev->queue->limits);
> >> >>  	pers->run(mddev);
> >> >>  	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> >> >> -	mddev_resume(mddev);
> >> >>  	if (!mddev->thread)
> >> >>  		md_update_sb(mddev, 1);
> >> >>  	sysfs_notify(&mddev->kobj, NULL, "level");
> >> >>  	md_new_event(mddev);
> >> >>  	rv = len;
> >> >>  out_unlock:
> >> >> +	mddev_resume(mddev);
> >> >>  	mddev_unlock(mddev);
> >> >>  	return rv;
> >> >>  }
> >> >> @@ -4490,6 +4496,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> >> >>  		int err;
> >> >>  		if (mddev->pers->start_reshape == NULL)
> >> >>  			return -EINVAL;
> >> >> +		mddev_suspend(mddev);
> >> >>  		err = mddev_lock(mddev);
> >> >>  		if (!err) {
> >> >>  			if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> >> >> @@ -4500,6 +4507,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> >> >>  			}
> >> >>  			mddev_unlock(mddev);
> >> >>  		}
> >> >> +		mddev_resume(mddev);
> >> >>  		if (err)
> >> >>  			return err;
> >> >>  		sysfs_notify(&mddev->kobj, NULL, "degraded");
> >> >
> >> > The analysis makes a lot of sense, thanks! The patch looks not solving the
> >> > problem though, because check_recovery will not write super if suspended isn't
> >> > 0.
> >> 
> >> Why does that matter?  In what case do you need the superblock to be
> >> written, but it doesn't happen.
> >> 
> >> check_recovery won't write the superblock while the mddev is locked
> >> either, and it is locked for most of level_store().
> >> When level_store() finished, it unlocks the device and that will trigger
> >> md_check_recovery() to be run, and the metadata will be written then.
> >> I don't think there is a particular need to update it before then.
> >
> > I get confused. md_write_start is waiting for MD_SB_CHANGE_PENDING cleared,
> > which is done in check_recovery. Your previous email describes this too.
> 
> Uhmm.... yes.  I see your point now.  Maybe we need might first patch as
> well, so md_check_recovery() can still update the superblock after
> ->suspended is set.
> 
> However, I starting looking at making sure mddev_suspend() was never
> called with mddev_lock() held, and that isn't straight forward.
> Most callers of mddev_suspend() do hold the lock.
> The only exceptions I found were:
>   dm-raid.c in raid_postsuspend()
>   raid5-cache.c in r5c_disable_writeback_async() and
>   r5c_journal_mode_set().
> 
> It might be easiest to change all of those to lock the mddev, then
> do the md_update_sb in mddev_suspend, when needed.
> 
> i.e. something like the following.  Thoughts?

Looks reasonable. why not hold the lock for mddev_resume too for dm-raid.c, at
least it protects ->suspended. The dm-raid.c is a little unformatable though,
other places always do lock, suspend, resume, unlock. The dm-raid is an
exception. Not quite sure if this is a problem though.

While this fix should work well, did you consider my previous proposal, which
should work I think and looks simplier.

Thanks,
Shaohua

> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 7d893228c40f..db79bd22418b 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3607,8 +3607,11 @@ static void raid_postsuspend(struct dm_target *ti)
>  {
>  	struct raid_set *rs = ti->private;
>  
> -	if (!rs->md.suspended)
> +	if (!rs->md.suspended) {
> +		mddev_lock_nointr(&rs->md);
>  		mddev_suspend(&rs->md);
> +		mddev_unlock(&rs->md);
> +	}
>  
>  	rs->md.ro = 1;
>  }
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 10367ffe92e3..6cbb37a7d445 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -320,14 +320,22 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>   * are completely handled.
>   * Once mddev_detach() is called and completes, the module will be
>   * completely unused.
> + * The caller must hold the mddev_lock.
> + * mddev_suspend() will update the superblock if it
> + * turns out that something is waiting in md_write_start().
>   */
>  void mddev_suspend(struct mddev *mddev)
>  {
>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
>  	if (mddev->suspended++)
>  		return;
> +	lockdep_assert_held(&mddev->reconfig_mutex);
> +
>  	synchronize_rcu();
> -	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
> +	wait_event_cmd(mddev->sb_wait, atomic_read(&mddev->active_io) == 0,
> +		       if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> +			       md_update_sb(mddev, 0),
> +		);
>  	mddev->pers->quiesce(mddev, 1);
>  
>  	del_timer_sync(&mddev->safemode_timer);
> @@ -7971,6 +7979,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
>  			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
>  			set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
>  			md_wakeup_thread(mddev->thread);
> +			wake_up(&mddev->sb_wait);
>  			did_change = 1;
>  		}
>  		spin_unlock(&mddev->lock);
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 4c00bc248287..c231c4a29903 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -682,13 +682,11 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>  	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
>  		mdname(mddev));
>  
> -	/* wait superblock change before suspend */
> -	wait_event(mddev->sb_wait,
> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> -
> +	mddev_lock_nointr(mddev);
>  	mddev_suspend(mddev);
>  	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>  	mddev_resume(mddev);
> +	mddev_unlock(mddev);
>  }
>  
>  static void r5l_submit_current_io(struct r5l_log *log)
> @@ -2542,6 +2540,7 @@ int r5c_journal_mode_set(struct mddev *mddev, int mode)
>  {
>  	struct r5conf *conf = mddev->private;
>  	struct r5l_log *log = conf->log;
> +	int ret = 0;
>  
>  	if (!log)
>  		return -ENODEV;
> @@ -2550,17 +2549,20 @@ int r5c_journal_mode_set(struct mddev *mddev, int mode)
>  	    mode > R5C_JOURNAL_MODE_WRITE_BACK)
>  		return -EINVAL;
>  
> +	mddev_lock_nointr(mddev);
>  	if (raid5_calc_degraded(conf) > 0 &&
>  	    mode == R5C_JOURNAL_MODE_WRITE_BACK)
> -		return -EINVAL;
> -
> -	mddev_suspend(mddev);
> -	conf->log->r5c_journal_mode = mode;
> -	mddev_resume(mddev);
> +		ret = -EINVAL;
> +	else {
> +		mddev_suspend(mddev);
> +		conf->log->r5c_journal_mode = mode;
> +		mddev_resume(mddev);
>  
> -	pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n",
> -		 mdname(mddev), mode, r5c_journal_mode_str[mode]);
> -	return 0;
> +		pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n",
> +			 mdname(mddev), mode, r5c_journal_mode_str[mode]);
> +	}
> +	mddev_unlock(mddev);
> +	return ret;
>  }
>  EXPORT_SYMBOL(r5c_journal_mode_set);
>  



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-24 22:57       ` Shaohua Li
  2017-05-25  1:30         ` NeilBrown
@ 2017-05-28 23:17         ` NeilBrown
  2017-05-30 17:41           ` Shaohua Li
  1 sibling, 1 reply; 29+ messages in thread
From: NeilBrown @ 2017-05-28 23:17 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Nix, linux-raid

[-- Attachment #1: Type: text/plain, Size: 8801 bytes --]

On Wed, May 24 2017, Shaohua Li wrote:

> On Wed, May 24, 2017 at 11:24:21AM +1000, Neil Brown wrote:
>> On Mon, May 22 2017, Nix wrote:
>> 
>> > On 22 May 2017, NeilBrown told this:
>> >
>> >> Probably something like this:
>> >>
>> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> >> index f6ae1d67bcd0..dbca31be22a1 100644
>> >> --- a/drivers/md/md.c
>> >> +++ b/drivers/md/md.c
>> >> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
>> >>   */
>> >>  void md_check_recovery(struct mddev *mddev)
>> >>  {
>> >> -	if (mddev->suspended)
>> >> -		return;
>> >>  
>> >>  	if (mddev->bitmap)
>> >>  		bitmap_daemon_work(mddev);
>> >> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
>> >>  		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
>> >>  
>> >>  		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
>> >> +		    mddev->suspended ||
>> >>  		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
>> >>  			goto not_running;
>> >>  		/* no recovery is running.
>> >>
>> >> though it's late so don't trust anything I write.
>> >>
>> >> If you try again it will almost certainly succeed.  I suspect this is a
>> >> hard race to hit - well done!!!
>> >
>> > Definitely not a hard race to hit :( I just hit it again with this
>> > patch.
>> >
>> > Absolutely identical hang:
>> >
>> > [  495.833520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > [  495.840618] mdadm           D    0  2700   2537 0x00000000
>> > [  495.847762] Call Trace:
>> > [  495.854825]  __schedule+0x290/0x810
>> > [  495.861905]  schedule+0x36/0x80
>> > [  495.868934]  mddev_suspend+0xb3/0xe0
>> > [  495.875926]  ? wake_atomic_t_function+0x60/0x60
>> > [  495.882976]  level_store+0x1a7/0x6c0
>> > [  495.889953]  ? md_ioctl+0xb7/0x1c10
>> > [  495.896901]  ? putname+0x53/0x60
>> > [  495.903807]  md_attr_store+0x83/0xc0
>> > [  495.910684]  sysfs_kf_write+0x37/0x40
>> > [  495.917547]  kernfs_fop_write+0x110/0x1a0
>> > [  495.924429]  __vfs_write+0x28/0x120
>> > [  495.931270]  ? kernfs_iop_get_link+0x172/0x1e0
>> > [  495.938126]  ? __alloc_fd+0x3f/0x170
>> > [  495.944906]  vfs_write+0xb6/0x1d0
>> > [  495.951646]  SyS_write+0x46/0xb0
>> > [  495.958338]  entry_SYSCALL_64_fastpath+0x13/0x94
>> >
>> > Everything else hangs the same way, too. This was surprising enough that
>> > I double-checked to be sure the patch was applied: it was. I suspect the
>> > deadlock is somewhat different than you supposed... (and quite possibly
>> > not a race at all, or I wouldn't be hitting it so consistently, every
>> > time. I mean, I only need to miss it *once* and I'll have reshaped... :) )
>> >
>> > It seems I can reproduce this on demand, so if you want to throw a patch
>> > with piles of extra printks my way, feel free.
>> 
>> Did you have md_write_start being called by syslog-ng again?
>> I wonder what syslog is logging - presumably something about the reshape
>> starting.
>> If you kill syslog-ng, can you start the reshape?
>> 
>> Alternately, this might do it.
>> I think the root problem is that it isn't safe to call mddev_suspend()
>> while holding the reconfig_mutex.
>> For complete safety I probably need to move the request_module() call
>> earlier, as that could block if a device was suspended (no memory
>> allocation allowed while device is suspended).
>> 
>> Thanks,
>> NeilBrown
>> 
>> 
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 10367ffe92e3..a7b9c0576479 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -324,8 +324,12 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>>  void mddev_suspend(struct mddev *mddev)
>>  {
>>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
>> +
>>  	if (mddev->suspended++)
>>  		return;
>> +#ifdef CONFIG_LOCKDEP
>> +	WARN_ON_ONCE(debug_locks && lockdep_is_held(&mddev->reconfig_mutex));
>> +#endif
>>  	synchronize_rcu();
>>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
>>  	mddev->pers->quiesce(mddev, 1);
>> @@ -3594,9 +3598,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>>  	if (slen == 0 || slen >= sizeof(clevel))
>>  		return -EINVAL;
>>  
>> +	mddev_suspend(mddev);
>>  	rv = mddev_lock(mddev);
>> -	if (rv)
>> +	if (rv) {
>> +		mddev_resume(mddev);
>>  		return rv;
>> +	}
>>  
>>  	if (mddev->pers == NULL) {
>>  		strncpy(mddev->clevel, buf, slen);
>> @@ -3687,7 +3694,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>>  	}
>>  
>>  	/* Looks like we have a winner */
>> -	mddev_suspend(mddev);
>>  	mddev_detach(mddev);
>>  
>>  	spin_lock(&mddev->lock);
>> @@ -3771,13 +3777,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>>  	blk_set_stacking_limits(&mddev->queue->limits);
>>  	pers->run(mddev);
>>  	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
>> -	mddev_resume(mddev);
>>  	if (!mddev->thread)
>>  		md_update_sb(mddev, 1);
>>  	sysfs_notify(&mddev->kobj, NULL, "level");
>>  	md_new_event(mddev);
>>  	rv = len;
>>  out_unlock:
>> +	mddev_resume(mddev);
>>  	mddev_unlock(mddev);
>>  	return rv;
>>  }
>> @@ -4490,6 +4496,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>  		int err;
>>  		if (mddev->pers->start_reshape == NULL)
>>  			return -EINVAL;
>> +		mddev_suspend(mddev);
>>  		err = mddev_lock(mddev);
>>  		if (!err) {
>>  			if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>> @@ -4500,6 +4507,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>  			}
>>  			mddev_unlock(mddev);
>>  		}
>> +		mddev_resume(mddev);
>>  		if (err)
>>  			return err;
>>  		sysfs_notify(&mddev->kobj, NULL, "degraded");
>
> The analysis makes a lot of sense, thanks! The patch looks not solving the
> problem though, because check_recovery will not write super if suspended isn't
> 0.
>
> I'm thinking of alternative wayt to fix this issue. A request which stalls in
> md_write_start reallys isn't an active IO. We could add another counter in
> md_write_start to record stalled request there. Then in mddev_suspend, wait
> event will wait for atomic_read(&mddev->active_io) == the_new_counter.
>

Sorry I didn't respond to this suggestion earlier.

It is important that
   mddev_suspend(mddev)
   mddev_detach(mddev)

results in the personality module being unused so it can be removed, as
module_put() will soon be called.
If a thread is stuck in md_write_start(), then when it is released it
will return to the personality's make_request function and will execute
code in the module.  So we cannot safely remove call module_put() while
code is blocked in md_write_start().

We could possibly move the md_write_start() call into md_make_request(),
so the personality code will never block.
Maybe something like this?
We probably also need to change all md_write_start() call is modules to
md_write_inc().

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 10367ffe92e3..3e45bb429a22 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -277,6 +277,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 		bio_endio(bio);
 		return BLK_QC_T_NONE;
 	}
+check_suspended:
 	smp_rmb(); /* Ensure implications of  'active' are visible */
 	rcu_read_lock();
 	if (mddev->suspended) {
@@ -294,6 +295,13 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	}
 	atomic_inc(&mddev->active_io);
 	rcu_read_unlock();
+	if (rw == WRITE) {
+		md_write_start(mddev, bio);
+		if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
+			md_write_end(mddev);
+			goto check_suspended;
+		}
+	}
 
 	/*
 	 * save the sectors now since our bio can
@@ -309,6 +317,8 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw], sectors);
 	part_stat_unlock();
 
+	md_write_end(mddev);
+
 	if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
 		wake_up(&mddev->sb_wait);
 
@@ -327,6 +337,7 @@ void mddev_suspend(struct mddev *mddev)
 	if (mddev->suspended++)
 		return;
 	synchronize_rcu();
+	wake_up(&mddev->sb_wait);
 	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
 	mddev->pers->quiesce(mddev, 1);
 
@@ -7979,7 +7990,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 	if (did_change)
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
 	wait_event(mddev->sb_wait,
-		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
+		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
 }
 EXPORT_SYMBOL(md_write_start);
 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
  2017-05-28 23:17         ` NeilBrown
@ 2017-05-30 17:41           ` Shaohua Li
  2017-06-05  6:49             ` [PATCH] md: fix deadlock between mddev_suspend() and md_write_start() NeilBrown
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2017-05-30 17:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: Nix, linux-raid

On Mon, May 29, 2017 at 09:17:10AM +1000, Neil Brown wrote:
> On Wed, May 24 2017, Shaohua Li wrote:
> 
> > On Wed, May 24, 2017 at 11:24:21AM +1000, Neil Brown wrote:
> >> On Mon, May 22 2017, Nix wrote:
> >> 
> >> > On 22 May 2017, NeilBrown told this:
> >> >
> >> >> Probably something like this:
> >> >>
> >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> >> index f6ae1d67bcd0..dbca31be22a1 100644
> >> >> --- a/drivers/md/md.c
> >> >> +++ b/drivers/md/md.c
> >> >> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
> >> >>   */
> >> >>  void md_check_recovery(struct mddev *mddev)
> >> >>  {
> >> >> -	if (mddev->suspended)
> >> >> -		return;
> >> >>  
> >> >>  	if (mddev->bitmap)
> >> >>  		bitmap_daemon_work(mddev);
> >> >> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
> >> >>  		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
> >> >>  
> >> >>  		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> >> >> +		    mddev->suspended ||
> >> >>  		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
> >> >>  			goto not_running;
> >> >>  		/* no recovery is running.
> >> >>
> >> >> though it's late so don't trust anything I write.
> >> >>
> >> >> If you try again it will almost certainly succeed.  I suspect this is a
> >> >> hard race to hit - well done!!!
> >> >
> >> > Definitely not a hard race to hit :( I just hit it again with this
> >> > patch.
> >> >
> >> > Absolutely identical hang:
> >> >
> >> > [  495.833520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> > [  495.840618] mdadm           D    0  2700   2537 0x00000000
> >> > [  495.847762] Call Trace:
> >> > [  495.854825]  __schedule+0x290/0x810
> >> > [  495.861905]  schedule+0x36/0x80
> >> > [  495.868934]  mddev_suspend+0xb3/0xe0
> >> > [  495.875926]  ? wake_atomic_t_function+0x60/0x60
> >> > [  495.882976]  level_store+0x1a7/0x6c0
> >> > [  495.889953]  ? md_ioctl+0xb7/0x1c10
> >> > [  495.896901]  ? putname+0x53/0x60
> >> > [  495.903807]  md_attr_store+0x83/0xc0
> >> > [  495.910684]  sysfs_kf_write+0x37/0x40
> >> > [  495.917547]  kernfs_fop_write+0x110/0x1a0
> >> > [  495.924429]  __vfs_write+0x28/0x120
> >> > [  495.931270]  ? kernfs_iop_get_link+0x172/0x1e0
> >> > [  495.938126]  ? __alloc_fd+0x3f/0x170
> >> > [  495.944906]  vfs_write+0xb6/0x1d0
> >> > [  495.951646]  SyS_write+0x46/0xb0
> >> > [  495.958338]  entry_SYSCALL_64_fastpath+0x13/0x94
> >> >
> >> > Everything else hangs the same way, too. This was surprising enough that
> >> > I double-checked to be sure the patch was applied: it was. I suspect the
> >> > deadlock is somewhat different than you supposed... (and quite possibly
> >> > not a race at all, or I wouldn't be hitting it so consistently, every
> >> > time. I mean, I only need to miss it *once* and I'll have reshaped... :) )
> >> >
> >> > It seems I can reproduce this on demand, so if you want to throw a patch
> >> > with piles of extra printks my way, feel free.
> >> 
> >> Did you have md_write_start being called by syslog-ng again?
> >> I wonder what syslog is logging - presumably something about the reshape
> >> starting.
> >> If you kill syslog-ng, can you start the reshape?
> >> 
> >> Alternately, this might do it.
> >> I think the root problem is that it isn't safe to call mddev_suspend()
> >> while holding the reconfig_mutex.
> >> For complete safety I probably need to move the request_module() call
> >> earlier, as that could block if a device was suspended (no memory
> >> allocation allowed while device is suspended).
> >> 
> >> Thanks,
> >> NeilBrown
> >> 
> >> 
> >> 
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index 10367ffe92e3..a7b9c0576479 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -324,8 +324,12 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
> >>  void mddev_suspend(struct mddev *mddev)
> >>  {
> >>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
> >> +
> >>  	if (mddev->suspended++)
> >>  		return;
> >> +#ifdef CONFIG_LOCKDEP
> >> +	WARN_ON_ONCE(debug_locks && lockdep_is_held(&mddev->reconfig_mutex));
> >> +#endif
> >>  	synchronize_rcu();
> >>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
> >>  	mddev->pers->quiesce(mddev, 1);
> >> @@ -3594,9 +3598,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >>  	if (slen == 0 || slen >= sizeof(clevel))
> >>  		return -EINVAL;
> >>  
> >> +	mddev_suspend(mddev);
> >>  	rv = mddev_lock(mddev);
> >> -	if (rv)
> >> +	if (rv) {
> >> +		mddev_resume(mddev);
> >>  		return rv;
> >> +	}
> >>  
> >>  	if (mddev->pers == NULL) {
> >>  		strncpy(mddev->clevel, buf, slen);
> >> @@ -3687,7 +3694,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >>  	}
> >>  
> >>  	/* Looks like we have a winner */
> >> -	mddev_suspend(mddev);
> >>  	mddev_detach(mddev);
> >>  
> >>  	spin_lock(&mddev->lock);
> >> @@ -3771,13 +3777,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> >>  	blk_set_stacking_limits(&mddev->queue->limits);
> >>  	pers->run(mddev);
> >>  	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> >> -	mddev_resume(mddev);
> >>  	if (!mddev->thread)
> >>  		md_update_sb(mddev, 1);
> >>  	sysfs_notify(&mddev->kobj, NULL, "level");
> >>  	md_new_event(mddev);
> >>  	rv = len;
> >>  out_unlock:
> >> +	mddev_resume(mddev);
> >>  	mddev_unlock(mddev);
> >>  	return rv;
> >>  }
> >> @@ -4490,6 +4496,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> >>  		int err;
> >>  		if (mddev->pers->start_reshape == NULL)
> >>  			return -EINVAL;
> >> +		mddev_suspend(mddev);
> >>  		err = mddev_lock(mddev);
> >>  		if (!err) {
> >>  			if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> >> @@ -4500,6 +4507,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> >>  			}
> >>  			mddev_unlock(mddev);
> >>  		}
> >> +		mddev_resume(mddev);
> >>  		if (err)
> >>  			return err;
> >>  		sysfs_notify(&mddev->kobj, NULL, "degraded");
> >
> > The analysis makes a lot of sense, thanks! The patch looks not solving the
> > problem though, because check_recovery will not write super if suspended isn't
> > 0.
> >
> > I'm thinking of alternative wayt to fix this issue. A request which stalls in
> > md_write_start reallys isn't an active IO. We could add another counter in
> > md_write_start to record stalled request there. Then in mddev_suspend, wait
> > event will wait for atomic_read(&mddev->active_io) == the_new_counter.
> >
> 
> Sorry I didn't respond to this suggestion earlier.
> 
> It is important that
>    mddev_suspend(mddev)
>    mddev_detach(mddev)
> 
> results in the personality module being unused so it can be removed, as
> module_put() will soon be called.
> If a thread is stuck in md_write_start(), then when it is released it
> will return to the personality's make_request function and will execute
> code in the module.  So we cannot safely remove call module_put() while
> code is blocked in md_write_start().

Ok, I thought md_suspend is only suspending the array. This makes sense,
thanks!

> We could possibly move the md_write_start() call into md_make_request(),
> so the personality code will never block.
> Maybe something like this?
> We probably also need to change all md_write_start() call is modules to
> md_write_inc().

Like this idea very much.

Thanks,
Shaohua
 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 10367ffe92e3..3e45bb429a22 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -277,6 +277,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  		bio_endio(bio);
>  		return BLK_QC_T_NONE;
>  	}
> +check_suspended:
>  	smp_rmb(); /* Ensure implications of  'active' are visible */
>  	rcu_read_lock();
>  	if (mddev->suspended) {
> @@ -294,6 +295,13 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  	}
>  	atomic_inc(&mddev->active_io);
>  	rcu_read_unlock();
> +	if (rw == WRITE) {
> +		md_write_start(mddev, bio);
> +		if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
> +			md_write_end(mddev);
> +			goto check_suspended;
> +		}
> +	}
>  
>  	/*
>  	 * save the sectors now since our bio can
> @@ -309,6 +317,8 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  	part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw], sectors);
>  	part_stat_unlock();
>  
> +	md_write_end(mddev);
> +
>  	if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
>  		wake_up(&mddev->sb_wait);
>  
> @@ -327,6 +337,7 @@ void mddev_suspend(struct mddev *mddev)
>  	if (mddev->suspended++)
>  		return;
>  	synchronize_rcu();
> +	wake_up(&mddev->sb_wait);
>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
>  	mddev->pers->quiesce(mddev, 1);
>  
> @@ -7979,7 +7990,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
>  	if (did_change)
>  		sysfs_notify_dirent_safe(mddev->sysfs_state);
>  	wait_event(mddev->sb_wait,
> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> +		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
>  }
>  EXPORT_SYMBOL(md_write_start);
>  



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] md: fix deadlock between mddev_suspend() and md_write_start()
  2017-05-30 17:41           ` Shaohua Li
@ 2017-06-05  6:49             ` NeilBrown
  2017-06-06  0:01               ` Shaohua Li
  0 siblings, 1 reply; 29+ messages in thread
From: NeilBrown @ 2017-06-05  6:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Nix, linux-raid

[-- Attachment #1: Type: text/plain, Size: 14893 bytes --]


If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/faulty.c    |  5 +++--
 drivers/md/linear.c    |  7 ++++---
 drivers/md/md.c        | 23 +++++++++++++++++------
 drivers/md/md.h        |  4 ++--
 drivers/md/multipath.c |  8 ++++----
 drivers/md/raid0.c     |  7 ++++---
 drivers/md/raid1.c     | 11 +++++++----
 drivers/md/raid10.c    | 10 ++++++----
 drivers/md/raid5.c     | 17 +++++++++--------
 9 files changed, 56 insertions(+), 36 deletions(-)

diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
index b0536cfd8e17..06a64d5d8c6c 100644
--- a/drivers/md/faulty.c
+++ b/drivers/md/faulty.c
@@ -170,7 +170,7 @@ static void add_sector(struct faulty_conf *conf, sector_t start, int mode)
 		conf->nfaults = n+1;
 }
 
-static void faulty_make_request(struct mddev *mddev, struct bio *bio)
+static bool faulty_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct faulty_conf *conf = mddev->private;
 	int failit = 0;
@@ -182,7 +182,7 @@ static void faulty_make_request(struct mddev *mddev, struct bio *bio)
 			 * just fail immediately
 			 */
 			bio_io_error(bio);
-			return;
+			return true;
 		}
 
 		if (check_sector(conf, bio->bi_iter.bi_sector,
@@ -224,6 +224,7 @@ static void faulty_make_request(struct mddev *mddev, struct bio *bio)
 		bio->bi_bdev = conf->rdev->bdev;
 
 	generic_make_request(bio);
+	return true;
 }
 
 static void faulty_status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index df6f2c98eca7..5f1eb9189542 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -245,7 +245,7 @@ static void linear_free(struct mddev *mddev, void *priv)
 	kfree(conf);
 }
 
-static void linear_make_request(struct mddev *mddev, struct bio *bio)
+static bool linear_make_request(struct mddev *mddev, struct bio *bio)
 {
 	char b[BDEVNAME_SIZE];
 	struct dev_info *tmp_dev;
@@ -254,7 +254,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
 		md_flush_request(mddev, bio);
-		return;
+		return true;
 	}
 
 	tmp_dev = which_dev(mddev, bio_sector);
@@ -292,7 +292,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 		mddev_check_write_zeroes(mddev, bio);
 		generic_make_request(bio);
 	}
-	return;
+	return true;
 
 out_of_bounds:
 	pr_err("md/linear:%s: make_request: Sector %llu out of bounds on dev %s: %llu sectors, offset %llu\n",
@@ -302,6 +302,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 	       (unsigned long long)tmp_dev->rdev->sectors,
 	       (unsigned long long)start_sector);
 	bio_io_error(bio);
+	return true;
 }
 
 static void linear_status (struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 23f4adf3a8cc..da59051545a2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -277,7 +277,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 		bio_endio(bio);
 		return BLK_QC_T_NONE;
 	}
-	smp_rmb(); /* Ensure implications of  'active' are visible */
+check_suspended:
 	rcu_read_lock();
 	if (mddev->suspended) {
 		DEFINE_WAIT(__wait);
@@ -302,7 +302,11 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	sectors = bio_sectors(bio);
 	/* bio could be mergeable after passing to underlayer */
 	bio->bi_opf &= ~REQ_NOMERGE;
-	mddev->pers->make_request(mddev, bio);
+	if (!mddev->pers->make_request(mddev, bio)) {
+		atomic_dec(&mddev->active_io);
+		wake_up(&mddev->sb_wait);
+		goto check_suspended;
+	}
 
 	cpu = part_stat_lock();
 	part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]);
@@ -327,6 +331,7 @@ void mddev_suspend(struct mddev *mddev)
 	if (mddev->suspended++)
 		return;
 	synchronize_rcu();
+	wake_up(&mddev->sb_wait);
 	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
 	mddev->pers->quiesce(mddev, 1);
 
@@ -7950,12 +7955,14 @@ EXPORT_SYMBOL(md_done_sync);
  * If we need to update some array metadata (e.g. 'active' flag
  * in superblock) before writing, schedule a superblock update
  * and wait for it to complete.
+ * A return value of 'false' means that the write wasn't recorded
+ * and cannot proceed as the array is being suspend.
  */
-void md_write_start(struct mddev *mddev, struct bio *bi)
+bool md_write_start(struct mddev *mddev, struct bio *bi)
 {
 	int did_change = 0;
 	if (bio_data_dir(bi) != WRITE)
-		return;
+		return true;
 
 	BUG_ON(mddev->ro == 1);
 	if (mddev->ro == 2) {
@@ -7987,9 +7994,13 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 	if (did_change)
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
 	wait_event(mddev->sb_wait,
-		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
+		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
+	if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
+		percpu_ref_put(&mddev->writes_pending);
+		return false;
+	}
+	return true;
 }
-EXPORT_SYMBOL(md_write_start);
 
 /* md_write_inc can only be called when md_write_start() has
  * already been called at least once of the current request.
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 0fa1de42c42b..63d342d560b8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -510,7 +510,7 @@ struct md_personality
 	int level;
 	struct list_head list;
 	struct module *owner;
-	void (*make_request)(struct mddev *mddev, struct bio *bio);
+	bool (*make_request)(struct mddev *mddev, struct bio *bio);
 	int (*run)(struct mddev *mddev);
 	void (*free)(struct mddev *mddev, void *priv);
 	void (*status)(struct seq_file *seq, struct mddev *mddev);
@@ -649,7 +649,7 @@ extern void md_wakeup_thread(struct md_thread *thread);
 extern void md_check_recovery(struct mddev *mddev);
 extern void md_reap_sync_thread(struct mddev *mddev);
 extern int mddev_init_writes_pending(struct mddev *mddev);
-extern void md_write_start(struct mddev *mddev, struct bio *bi);
+extern bool md_write_start(struct mddev *mddev, struct bio *bi);
 extern void md_write_inc(struct mddev *mddev, struct bio *bi);
 extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index e95d521d93e9..c8d985ba008d 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -106,7 +106,7 @@ static void multipath_end_request(struct bio *bio)
 	rdev_dec_pending(rdev, conf->mddev);
 }
 
-static void multipath_make_request(struct mddev *mddev, struct bio * bio)
+static bool multipath_make_request(struct mddev *mddev, struct bio * bio)
 {
 	struct mpconf *conf = mddev->private;
 	struct multipath_bh * mp_bh;
@@ -114,7 +114,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
 
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
 		md_flush_request(mddev, bio);
-		return;
+		return true;
 	}
 
 	mp_bh = mempool_alloc(conf->pool, GFP_NOIO);
@@ -126,7 +126,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
 	if (mp_bh->path < 0) {
 		bio_io_error(bio);
 		mempool_free(mp_bh, conf->pool);
-		return;
+		return true;
 	}
 	multipath = conf->multipaths + mp_bh->path;
 
@@ -141,7 +141,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
 	mddev_check_writesame(mddev, &mp_bh->bio);
 	mddev_check_write_zeroes(mddev, &mp_bh->bio);
 	generic_make_request(&mp_bh->bio);
-	return;
+	return true;
 }
 
 static void multipath_status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index d6c0bc76e837..94d9ae9b0fd0 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -548,7 +548,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 	bio_endio(bio);
 }
 
-static void raid0_make_request(struct mddev *mddev, struct bio *bio)
+static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
 	struct md_rdev *tmp_dev;
@@ -559,12 +559,12 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
 		md_flush_request(mddev, bio);
-		return;
+		return true;
 	}
 
 	if (unlikely((bio_op(bio) == REQ_OP_DISCARD))) {
 		raid0_handle_discard(mddev, bio);
-		return;
+		return true;
 	}
 
 	bio_sector = bio->bi_iter.bi_sector;
@@ -599,6 +599,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 	mddev_check_writesame(mddev, bio);
 	mddev_check_write_zeroes(mddev, bio);
 	generic_make_request(bio);
+	return true;
 }
 
 static void raid0_status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e1a7e3d4c5e4..c71739b87ab7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1321,7 +1321,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	 * Continue immediately if no resync is active currently.
 	 */
 
-	md_write_start(mddev, bio); /* wait on superblock update early */
 
 	if ((bio_end_sector(bio) > mddev->suspend_lo &&
 	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
@@ -1550,13 +1549,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	wake_up(&conf->wait_barrier);
 }
 
-static void raid1_make_request(struct mddev *mddev, struct bio *bio)
+static bool raid1_make_request(struct mddev *mddev, struct bio *bio)
 {
 	sector_t sectors;
 
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
 		md_flush_request(mddev, bio);
-		return;
+		return true;
 	}
 
 	/*
@@ -1571,8 +1570,12 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 
 	if (bio_data_dir(bio) == READ)
 		raid1_read_request(mddev, bio, sectors, NULL);
-	else
+	else {
+		if (!md_write_start(mddev,bio))
+			return false;
 		raid1_write_request(mddev, bio, sectors);
+	}
+	return true;
 }
 
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 797ed60abd5e..52acffa7a06a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1303,8 +1303,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	sector_t sectors;
 	int max_sectors;
 
-	md_write_start(mddev, bio);
-
 	/*
 	 * Register the new request and wait if the reconstruction
 	 * thread has put up a bar for new requests.
@@ -1525,7 +1523,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 		raid10_write_request(mddev, bio, r10_bio);
 }
 
-static void raid10_make_request(struct mddev *mddev, struct bio *bio)
+static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r10conf *conf = mddev->private;
 	sector_t chunk_mask = (conf->geo.chunk_mask & conf->prev.chunk_mask);
@@ -1534,9 +1532,12 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
 		md_flush_request(mddev, bio);
-		return;
+		return true;
 	}
 
+	if (!md_write_start(mddev, bio))
+		return false;
+
 	/*
 	 * If this request crosses a chunk boundary, we need to split
 	 * it.
@@ -1553,6 +1554,7 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 
 	/* In case raid10d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
+	return true;
 }
 
 static void raid10_status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 861fc9a8d1be..ad1a644a17bc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5469,7 +5469,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
 
 	bi->bi_next = NULL;
-	md_write_start(mddev, bi);
 
 	stripe_sectors = conf->chunk_sectors *
 		(conf->raid_disks - conf->max_degraded);
@@ -5539,11 +5538,10 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		release_stripe_plug(mddev, sh);
 	}
 
-	md_write_end(mddev);
 	bio_endio(bi);
 }
 
-static void raid5_make_request(struct mddev *mddev, struct bio * bi)
+static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
 	int dd_idx;
@@ -5559,10 +5557,10 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		int ret = r5l_handle_flush_request(conf->log, bi);
 
 		if (ret == 0)
-			return;
+			return true;
 		if (ret == -ENODEV) {
 			md_flush_request(mddev, bi);
-			return;
+			return true;
 		}
 		/* ret == -EAGAIN, fallback */
 		/*
@@ -5572,6 +5570,8 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		do_flush = bi->bi_opf & REQ_PREFLUSH;
 	}
 
+	if (!md_write_start(mddev, bi))
+		return false;
 	/*
 	 * If array is degraded, better not do chunk aligned read because
 	 * later we might have to read it again in order to reconstruct
@@ -5581,18 +5581,18 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	    mddev->reshape_position == MaxSector) {
 		bi = chunk_aligned_read(mddev, bi);
 		if (!bi)
-			return;
+			return true;
 	}
 
 	if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
 		make_discard_request(mddev, bi);
-		return;
+		md_write_end(mddev);
+		return true;
 	}
 
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
-	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
@@ -5730,6 +5730,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	if (rw == WRITE)
 		md_write_end(mddev);
 	bio_endio(bi);
+	return true;
 }
 
 static sector_t raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks);
-- 
2.12.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] md: fix deadlock between mddev_suspend() and md_write_start()
  2017-06-05  6:49             ` [PATCH] md: fix deadlock between mddev_suspend() and md_write_start() NeilBrown
@ 2017-06-06  0:01               ` Shaohua Li
  2017-06-07  1:45                 ` NeilBrown
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2017-06-06  0:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Nix, linux-raid

On Mon, Jun 05, 2017 at 04:49:39PM +1000, Neil Brown wrote:
> 
> If mddev_suspend() races with md_write_start() we can deadlock
> with mddev_suspend() waiting for the request that is currently
> in md_write_start() to complete the ->make_request() call,
> and md_write_start() waiting for the metadata to be updated
> to mark the array as 'dirty'.
> As metadata updates done by md_check_recovery() only happen then
> the mddev_lock() can be claimed, and as mddev_suspend() is often
> called with the lock held, these threads wait indefinitely for each
> other.
> 
> We fix this by having md_write_start() abort if mddev_suspend()
> is happening, and ->make_request() aborts if md_write_start()
> aborted.
> md_make_request() can detect this abort, decrease the ->active_io
> count, and wait for mddev_suspend().

Applied the two patches, thanks! For this one, I added md_start_write symbol back. Will
push these to Linus in the 4.12 cycle.


> Reported-by: Nix <nix@esperi.org.uk>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/faulty.c    |  5 +++--
>  drivers/md/linear.c    |  7 ++++---
>  drivers/md/md.c        | 23 +++++++++++++++++------
>  drivers/md/md.h        |  4 ++--
>  drivers/md/multipath.c |  8 ++++----
>  drivers/md/raid0.c     |  7 ++++---
>  drivers/md/raid1.c     | 11 +++++++----
>  drivers/md/raid10.c    | 10 ++++++----
>  drivers/md/raid5.c     | 17 +++++++++--------
>  9 files changed, 56 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
> index b0536cfd8e17..06a64d5d8c6c 100644
> --- a/drivers/md/faulty.c
> +++ b/drivers/md/faulty.c
> @@ -170,7 +170,7 @@ static void add_sector(struct faulty_conf *conf, sector_t start, int mode)
>  		conf->nfaults = n+1;
>  }
>  
> -static void faulty_make_request(struct mddev *mddev, struct bio *bio)
> +static bool faulty_make_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct faulty_conf *conf = mddev->private;
>  	int failit = 0;
> @@ -182,7 +182,7 @@ static void faulty_make_request(struct mddev *mddev, struct bio *bio)
>  			 * just fail immediately
>  			 */
>  			bio_io_error(bio);
> -			return;
> +			return true;
>  		}
>  
>  		if (check_sector(conf, bio->bi_iter.bi_sector,
> @@ -224,6 +224,7 @@ static void faulty_make_request(struct mddev *mddev, struct bio *bio)
>  		bio->bi_bdev = conf->rdev->bdev;
>  
>  	generic_make_request(bio);
> +	return true;
>  }
>  
>  static void faulty_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> index df6f2c98eca7..5f1eb9189542 100644
> --- a/drivers/md/linear.c
> +++ b/drivers/md/linear.c
> @@ -245,7 +245,7 @@ static void linear_free(struct mddev *mddev, void *priv)
>  	kfree(conf);
>  }
>  
> -static void linear_make_request(struct mddev *mddev, struct bio *bio)
> +static bool linear_make_request(struct mddev *mddev, struct bio *bio)
>  {
>  	char b[BDEVNAME_SIZE];
>  	struct dev_info *tmp_dev;
> @@ -254,7 +254,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
>  
>  	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
>  		md_flush_request(mddev, bio);
> -		return;
> +		return true;
>  	}
>  
>  	tmp_dev = which_dev(mddev, bio_sector);
> @@ -292,7 +292,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
>  		mddev_check_write_zeroes(mddev, bio);
>  		generic_make_request(bio);
>  	}
> -	return;
> +	return true;
>  
>  out_of_bounds:
>  	pr_err("md/linear:%s: make_request: Sector %llu out of bounds on dev %s: %llu sectors, offset %llu\n",
> @@ -302,6 +302,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
>  	       (unsigned long long)tmp_dev->rdev->sectors,
>  	       (unsigned long long)start_sector);
>  	bio_io_error(bio);
> +	return true;
>  }
>  
>  static void linear_status (struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 23f4adf3a8cc..da59051545a2 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -277,7 +277,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  		bio_endio(bio);
>  		return BLK_QC_T_NONE;
>  	}
> -	smp_rmb(); /* Ensure implications of  'active' are visible */
> +check_suspended:
>  	rcu_read_lock();
>  	if (mddev->suspended) {
>  		DEFINE_WAIT(__wait);
> @@ -302,7 +302,11 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  	sectors = bio_sectors(bio);
>  	/* bio could be mergeable after passing to underlayer */
>  	bio->bi_opf &= ~REQ_NOMERGE;
> -	mddev->pers->make_request(mddev, bio);
> +	if (!mddev->pers->make_request(mddev, bio)) {
> +		atomic_dec(&mddev->active_io);
> +		wake_up(&mddev->sb_wait);
> +		goto check_suspended;
> +	}
>  
>  	cpu = part_stat_lock();
>  	part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]);
> @@ -327,6 +331,7 @@ void mddev_suspend(struct mddev *mddev)
>  	if (mddev->suspended++)
>  		return;
>  	synchronize_rcu();
> +	wake_up(&mddev->sb_wait);
>  	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
>  	mddev->pers->quiesce(mddev, 1);
>  
> @@ -7950,12 +7955,14 @@ EXPORT_SYMBOL(md_done_sync);
>   * If we need to update some array metadata (e.g. 'active' flag
>   * in superblock) before writing, schedule a superblock update
>   * and wait for it to complete.
> + * A return value of 'false' means that the write wasn't recorded
> + * and cannot proceed as the array is being suspend.
>   */
> -void md_write_start(struct mddev *mddev, struct bio *bi)
> +bool md_write_start(struct mddev *mddev, struct bio *bi)
>  {
>  	int did_change = 0;
>  	if (bio_data_dir(bi) != WRITE)
> -		return;
> +		return true;
>  
>  	BUG_ON(mddev->ro == 1);
>  	if (mddev->ro == 2) {
> @@ -7987,9 +7994,13 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
>  	if (did_change)
>  		sysfs_notify_dirent_safe(mddev->sysfs_state);
>  	wait_event(mddev->sb_wait,
> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> +		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
> +	if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
> +		percpu_ref_put(&mddev->writes_pending);
> +		return false;
> +	}
> +	return true;
>  }
> -EXPORT_SYMBOL(md_write_start);
>  
>  /* md_write_inc can only be called when md_write_start() has
>   * already been called at least once of the current request.
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 0fa1de42c42b..63d342d560b8 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -510,7 +510,7 @@ struct md_personality
>  	int level;
>  	struct list_head list;
>  	struct module *owner;
> -	void (*make_request)(struct mddev *mddev, struct bio *bio);
> +	bool (*make_request)(struct mddev *mddev, struct bio *bio);
>  	int (*run)(struct mddev *mddev);
>  	void (*free)(struct mddev *mddev, void *priv);
>  	void (*status)(struct seq_file *seq, struct mddev *mddev);
> @@ -649,7 +649,7 @@ extern void md_wakeup_thread(struct md_thread *thread);
>  extern void md_check_recovery(struct mddev *mddev);
>  extern void md_reap_sync_thread(struct mddev *mddev);
>  extern int mddev_init_writes_pending(struct mddev *mddev);
> -extern void md_write_start(struct mddev *mddev, struct bio *bi);
> +extern bool md_write_start(struct mddev *mddev, struct bio *bi);
>  extern void md_write_inc(struct mddev *mddev, struct bio *bi);
>  extern void md_write_end(struct mddev *mddev);
>  extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
> diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
> index e95d521d93e9..c8d985ba008d 100644
> --- a/drivers/md/multipath.c
> +++ b/drivers/md/multipath.c
> @@ -106,7 +106,7 @@ static void multipath_end_request(struct bio *bio)
>  	rdev_dec_pending(rdev, conf->mddev);
>  }
>  
> -static void multipath_make_request(struct mddev *mddev, struct bio * bio)
> +static bool multipath_make_request(struct mddev *mddev, struct bio * bio)
>  {
>  	struct mpconf *conf = mddev->private;
>  	struct multipath_bh * mp_bh;
> @@ -114,7 +114,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
>  
>  	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
>  		md_flush_request(mddev, bio);
> -		return;
> +		return true;
>  	}
>  
>  	mp_bh = mempool_alloc(conf->pool, GFP_NOIO);
> @@ -126,7 +126,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
>  	if (mp_bh->path < 0) {
>  		bio_io_error(bio);
>  		mempool_free(mp_bh, conf->pool);
> -		return;
> +		return true;
>  	}
>  	multipath = conf->multipaths + mp_bh->path;
>  
> @@ -141,7 +141,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
>  	mddev_check_writesame(mddev, &mp_bh->bio);
>  	mddev_check_write_zeroes(mddev, &mp_bh->bio);
>  	generic_make_request(&mp_bh->bio);
> -	return;
> +	return true;
>  }
>  
>  static void multipath_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index d6c0bc76e837..94d9ae9b0fd0 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -548,7 +548,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
>  	bio_endio(bio);
>  }
>  
> -static void raid0_make_request(struct mddev *mddev, struct bio *bio)
> +static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct strip_zone *zone;
>  	struct md_rdev *tmp_dev;
> @@ -559,12 +559,12 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
>  
>  	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
>  		md_flush_request(mddev, bio);
> -		return;
> +		return true;
>  	}
>  
>  	if (unlikely((bio_op(bio) == REQ_OP_DISCARD))) {
>  		raid0_handle_discard(mddev, bio);
> -		return;
> +		return true;
>  	}
>  
>  	bio_sector = bio->bi_iter.bi_sector;
> @@ -599,6 +599,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
>  	mddev_check_writesame(mddev, bio);
>  	mddev_check_write_zeroes(mddev, bio);
>  	generic_make_request(bio);
> +	return true;
>  }
>  
>  static void raid0_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index e1a7e3d4c5e4..c71739b87ab7 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1321,7 +1321,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  	 * Continue immediately if no resync is active currently.
>  	 */
>  
> -	md_write_start(mddev, bio); /* wait on superblock update early */
>  
>  	if ((bio_end_sector(bio) > mddev->suspend_lo &&
>  	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
> @@ -1550,13 +1549,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  	wake_up(&conf->wait_barrier);
>  }
>  
> -static void raid1_make_request(struct mddev *mddev, struct bio *bio)
> +static bool raid1_make_request(struct mddev *mddev, struct bio *bio)
>  {
>  	sector_t sectors;
>  
>  	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
>  		md_flush_request(mddev, bio);
> -		return;
> +		return true;
>  	}
>  
>  	/*
> @@ -1571,8 +1570,12 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>  
>  	if (bio_data_dir(bio) == READ)
>  		raid1_read_request(mddev, bio, sectors, NULL);
> -	else
> +	else {
> +		if (!md_write_start(mddev,bio))
> +			return false;
>  		raid1_write_request(mddev, bio, sectors);
> +	}
> +	return true;
>  }
>  
>  static void raid1_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 797ed60abd5e..52acffa7a06a 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1303,8 +1303,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>  	sector_t sectors;
>  	int max_sectors;
>  
> -	md_write_start(mddev, bio);
> -
>  	/*
>  	 * Register the new request and wait if the reconstruction
>  	 * thread has put up a bar for new requests.
> @@ -1525,7 +1523,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
>  		raid10_write_request(mddev, bio, r10_bio);
>  }
>  
> -static void raid10_make_request(struct mddev *mddev, struct bio *bio)
> +static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct r10conf *conf = mddev->private;
>  	sector_t chunk_mask = (conf->geo.chunk_mask & conf->prev.chunk_mask);
> @@ -1534,9 +1532,12 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>  
>  	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
>  		md_flush_request(mddev, bio);
> -		return;
> +		return true;
>  	}
>  
> +	if (!md_write_start(mddev, bio))
> +		return false;
> +
>  	/*
>  	 * If this request crosses a chunk boundary, we need to split
>  	 * it.
> @@ -1553,6 +1554,7 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>  
>  	/* In case raid10d snuck in to freeze_array */
>  	wake_up(&conf->wait_barrier);
> +	return true;
>  }
>  
>  static void raid10_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 861fc9a8d1be..ad1a644a17bc 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5469,7 +5469,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>  	last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
>  
>  	bi->bi_next = NULL;
> -	md_write_start(mddev, bi);
>  
>  	stripe_sectors = conf->chunk_sectors *
>  		(conf->raid_disks - conf->max_degraded);
> @@ -5539,11 +5538,10 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>  		release_stripe_plug(mddev, sh);
>  	}
>  
> -	md_write_end(mddev);
>  	bio_endio(bi);
>  }
>  
> -static void raid5_make_request(struct mddev *mddev, struct bio * bi)
> +static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  {
>  	struct r5conf *conf = mddev->private;
>  	int dd_idx;
> @@ -5559,10 +5557,10 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>  		int ret = r5l_handle_flush_request(conf->log, bi);
>  
>  		if (ret == 0)
> -			return;
> +			return true;
>  		if (ret == -ENODEV) {
>  			md_flush_request(mddev, bi);
> -			return;
> +			return true;
>  		}
>  		/* ret == -EAGAIN, fallback */
>  		/*
> @@ -5572,6 +5570,8 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>  		do_flush = bi->bi_opf & REQ_PREFLUSH;
>  	}
>  
> +	if (!md_write_start(mddev, bi))
> +		return false;
>  	/*
>  	 * If array is degraded, better not do chunk aligned read because
>  	 * later we might have to read it again in order to reconstruct
> @@ -5581,18 +5581,18 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>  	    mddev->reshape_position == MaxSector) {
>  		bi = chunk_aligned_read(mddev, bi);
>  		if (!bi)
> -			return;
> +			return true;
>  	}
>  
>  	if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
>  		make_discard_request(mddev, bi);
> -		return;
> +		md_write_end(mddev);
> +		return true;
>  	}
>  
>  	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
>  	last_sector = bio_end_sector(bi);
>  	bi->bi_next = NULL;
> -	md_write_start(mddev, bi);
>  
>  	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
>  	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
> @@ -5730,6 +5730,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>  	if (rw == WRITE)
>  		md_write_end(mddev);
>  	bio_endio(bi);
> +	return true;
>  }
>  
>  static sector_t raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks);
> -- 
> 2.12.2
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] md: fix deadlock between mddev_suspend() and md_write_start()
  2017-06-06  0:01               ` Shaohua Li
@ 2017-06-07  1:45                 ` NeilBrown
  0 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2017-06-07  1:45 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Nix, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1201 bytes --]

On Mon, Jun 05 2017, Shaohua Li wrote:

> On Mon, Jun 05, 2017 at 04:49:39PM +1000, Neil Brown wrote:
>> 
>> If mddev_suspend() races with md_write_start() we can deadlock
>> with mddev_suspend() waiting for the request that is currently
>> in md_write_start() to complete the ->make_request() call,
>> and md_write_start() waiting for the metadata to be updated
>> to mark the array as 'dirty'.
>> As metadata updates done by md_check_recovery() only happen then
>> the mddev_lock() can be claimed, and as mddev_suspend() is often
>> called with the lock held, these threads wait indefinitely for each
>> other.
>> 
>> We fix this by having md_write_start() abort if mddev_suspend()
>> is happening, and ->make_request() aborts if md_write_start()
>> aborted.
>> md_make_request() can detect this abort, decrease the ->active_io
>> count, and wait for mddev_suspend().
>
> Applied the two patches, thanks! For this one, I added md_start_write symbol back. Will
> push these to Linus in the 4.12 cycle.
>


Thanks for fixing the md_start_write export!  At one stage in
development I made it "static", but then forgot the export when I put it
back :-(

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2017-06-07  1:45 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-22  9:13 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request Nix
2017-05-22 11:35 ` NeilBrown
2017-05-22 15:30   ` Nix
2017-05-22 19:07     ` Wols Lists
2017-05-22 20:43       ` Nix
2017-05-23  1:20         ` NeilBrown
2017-05-23 10:10           ` Nix
2017-05-23  1:39       ` NeilBrown
2017-05-23 14:47         ` Wols Lists
2017-05-24  1:50           ` NeilBrown
2017-05-23  1:07     ` NeilBrown
2017-05-22 21:38   ` Nix
2017-05-23 14:16     ` Wols Lists
2017-05-23 15:00       ` Nix
2017-05-24  1:24     ` NeilBrown
2017-05-24 13:28       ` Nix
2017-05-25  1:31         ` NeilBrown
2017-05-25 12:14           ` Nix
2017-05-24 19:42       ` Nix
2017-05-24 22:57       ` Shaohua Li
2017-05-25  1:30         ` NeilBrown
2017-05-25  1:46           ` Shaohua Li
2017-05-26  3:23             ` NeilBrown
2017-05-26 16:40               ` Shaohua Li
2017-05-28 23:17         ` NeilBrown
2017-05-30 17:41           ` Shaohua Li
2017-06-05  6:49             ` [PATCH] md: fix deadlock between mddev_suspend() and md_write_start() NeilBrown
2017-06-06  0:01               ` Shaohua Li
2017-06-07  1:45                 ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.