From mboxrd@z Thu Jan  1 00:00:00 1970
From: Nix <nix@esperi.org.uk>
Subject: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
Date: Mon, 22 May 2017 10:13:58 +0100
Message-ID: <87lgppz221.fsf@esperi.org.uk>
Mime-Version: 1.0
Content-Type: text/plain
Return-path: <linux-raid-owner@vger.kernel.org>
Sender: linux-raid-owner@vger.kernel.org
To: linux-raid@vger.kernel.org, linux-bcache <linux-bcache@vger.kernel.org>
List-Id: linux-raid.ids

[I think bcache is blameless here, but there may be some interaction, so
 I've Cc:ed both lists. I'm fairly sure the problem is mddev_suspend()
 being called from level_store() and scheduling after suspension has
 happened, but I'm not sure what the fix might be.]

So I have a box with three RAID arrays on the same physical
rotating-rust devices:

 - a RAID-0 /dev/md/transient atop /dev/sd*2
 - a RAID-5 /dev/md/fast, a bcache backing device (with disabled cache
   for now) atop /dev/sd*3
 - a RAID-5 /dev/md/slow atop /dev/sd*4.

The rootfs and everything the system needs to run, including mdadm, is
on /dev/md/fast. fs-wise, everything is using xfs except for
/dev/md/transient, which houses an ext4 (unjournalled, for speed over
robustness).

I just got a new disk and grew /dev/md/transient to use it without
incident, then grew /dev/md/slow into a RAID-6:

mdadm --add /dev/md/slow /dev/sdf4
mdadm --grow /dev/md/slow --level=6 --backup-file=/.transient/backup-slow

This worked fine.

But the *exact same command* applied to /dev/md/fast hangs:

mdadm --add /dev/md/fast /dev/sdf3
mdadm --grow /dev/md/fast --level=6 --backup-file=/.transient/backup-fast
[hang here]

At this point, all disk I/O to /dev/md/fast appears suspended: because /
is on it, this leaves me rather unable to do anything but reboot.
Thankfully the deadlock seemingly happens before anything crucial is
done, and the machine reboots and brigns up /dev/md/fast as an
unreshaped raid5 array with a spare.

Is this workflow impossible? Do you have to reshape the device the
rootfs is on from early userspace or something like that? If so... this
leaves a crucial mdadm running across the early userspace transition:
what do we do about the fs the backup file is on (since the transition
usually deletes everything)? Do we even *need* a backup file for a
reshape that's growing onto a spare, as the manpage suggests we do? Do
we simply have to leave the machine unavailable for days while the
reshape completes? (I hope not.)

That hangcheck output I was able to capture by eye suggests a deadlock
in block I/O during a brief instant of suspended I/O to the device. The
first hang we see comes from bcache, but since it's in "none" mode this
is pretty clearly just a writeback (note the md_make_request at the top
there):

[  248.877539] Workqueue: bcache cached_dev_nodata
[  248.888590] Call Trace:
[  248.899337]  __schedule+0x290/0x810
[  248.909745]  schedule+0x36/0x80
[  248.920113]  md_make_request+0x9c/0x220
[  248.930229]  ? wake_atomic_t_function+0x60/0x60
[  248.940063]  generic_make_request+0xfc/0x260
[  248.949803]  cached_dev_nodata+0x28/0x80
[  248.959393]  ? cached_dev_nodata+0x28/0x80
[  248.968998]  process_one_work+0x1e3/0x440
[  248.978414]  worker_thread+0x48/0x4d0
[  248.987596]  kthread+0x108/0x140
[  248.996747]  ? process_one_work+0x440/0x440
[  249.005965]  ? __kthread_create_on_node+0x150/0x150
[  249.015264]  ret_from_fork+0x29/0x40
[  249.024544] INFO: task kworker/u40:5:245 blocked for more than 120 seconds.

This is fairly obviously

        /* If it's a flush, we send the flush to the backing device too */
        closure_bio_submit(bio, cl);

in drivers/md/bcache/request.c, which is fine.

After that we start to see xfs writeback hangs, all hanging on
md_make_request() or I/O scheduling:

[  249.062946] Workqueue: writeback wb_workfn (flush-252:5)
[  249.072733] Call Trace:
[  249.082390]  __schedule+0x290/0x810
[  249.092053]  schedule+0x36/0x80
[  249.101568]  md_make_request+0x9c/0x220
[  249.110956]  ? wake_atomic_t_function+0x60/0x60
[  249.120202]  generic_make_request+0xfc/0x260
[  249.129219]  submit_bio+0x64/0x120
[  249.137968]  ? submit_bio+0x64/0x120
[  249.146840]  ? xfs_setfilesize_trans_alloc.isra.4+0x2f/0x70
[  249.155822]  xfs_submit_ioend+0x70/0x190
[  249.164805]  xfs_vm_writepages+0x65/0x70
[  249.173620]  do_writepages+0x1e/0x30
[  249.182239]  __writeback_single_inode+0x45/0x320
[  249.190838]  writeback_sb_inodes+0x191/0x490
[  249.199459]  __writeback_inodes_wb+0x92/0xc0
[  249.207988]  wb_writeback+0x243/0x2d0
[  249.216395]  wb_workfn+0x299/0x360
[  249.224851]  ? wb_workfn+0x299/0x360
[  249.233393]  process_one_work+0x1e3/0x440
[  249.242049]  worker_thread+0x48/0x4d0
[  249.250404]  kthread+0x108/0x140
[  249.258408]  ? process_one_work+0x440/0x440
[  249.266148]  ? __kthread_create_on_node+0x150/0x150
[  249.273652]  ret_from_fork+0x29/0x40

[  249.310928] Workqueue: xfs-cil/dm-5 xlog_cil_push_work
[  249.318453] Call Trace:
[  249.325929]  __schedule+0x290/0x810
[  249.333456]  schedule+0x36/0x80
[  249.340940]  schedule_timeout+0x207/0x360
[  249.348461]  ? cached_dev_make_request+0x2bc/0xd00
[  249.356018]  io_schedule_timeout+0x1e/0x50
[  249.363533]  ? io_schedule_timeout+0x1e/0x50
[  249.371078]  wait_for_common_io.constprop.0+0x92/0x110
[  249.378711]  ? try_to_wake_up+0x400/0x400
[  249.386339]  wait_for_completion_io+0x18/0x20
[  249.394026]  submit_bio_wait+0x59/0x70
[  249.401685]  blkdev_issue_flush+0x5c/0x90
[  249.409321]  xfs_blkdev_issue_flush+0x19/0x20
[  249.416951]  xlog_sync+0x2a1/0x3b0
[  249.424439]  xlog_state_release_iclog+0x6d/0xd0
[  249.431857]  xlog_write+0x64a/0x6e0
[  249.439092]  xlog_cil_push+0x230/0x450
[  249.446169]  xlog_cil_push_work+0x15/0x20
[  249.453128]  process_one_work+0x1e3/0x440
[  249.460030]  worker_thread+0x221/0x4d0
[  249.466942]  kthread+0x108/0x140
[  249.473843]  ? process_one_work+0x440/0x440
[  249.480762]  ? __kthread_create_on_node+0x150/0x150
[  249.487743]  ret_from_fork+0x29/0x40

[  249.524000] Workqueue: xfs-log/dm-5 xfs_log_worker
[  249.531490] Call Trace:
[  249.538923]  __schedule+0x290/0x810
[  249.546366]  schedule+0x36/0x80
[  249.553756]  schedule_timeout+0x207/0x360
[  249.561132]  ? check_preempt_curr+0x79/0x90
[  249.568514]  ? ttwu_do_wakeup.isra.15+0x1e/0x160
[  249.575957]  wait_for_common+0xaa/0x160
[  249.583432]  ? wait_for_common+0xaa/0x160
[  249.590943]  ? try_to_wake_up+0x400/0x400
[  249.598447]  wait_for_completion+0x1d/0x20
[  249.605984]  flush_work+0xfb/0x1b0
[  249.613509]  ? flush_workqueue_prep_pwqs+0x1a0/0x1a0
[  249.621093]  xlog_cil_force_lsn+0x75/0x1d0
[  249.628640]  ? dequeue_task_fair+0x757/0x1920
[  249.636169]  _xfs_log_force+0x76/0x280
[  249.643524]  ? xfs_log_worker+0x32/0x100
[  249.650715]  xfs_log_force+0x2c/0x80
[  249.657719]  xfs_log_worker+0x32/0x100
[  249.664653]  process_one_work+0x1e3/0x440
[  249.671498]  worker_thread+0x48/0x4d0
[  249.678354]  kthread+0x108/0x140
[  249.685218]  ? process_one_work+0x440/0x440
[  249.692123]  ? __kthread_create_on_node+0x150/0x150
[  249.699057]  ret_from_fork+0x29/0x40

[  249.727654] syslog-ng       D    0  1658   1657 0x00000000
[  249.735066] Call Trace:
[  249.742333]  __schedule+0x290/0x810
[  249.749624]  schedule+0x36/0x80
[  249.756915]  md_write_start+0x8f/0x160
[  249.764185]  ? wake_atomic_t_function+0x60/0x60
[  249.771469]  raid5_make_request+0x7a/0xd40
[  249.778762]  ? bch_data_insert_start+0x4da/0x5e0
[  249.786143]  ? wake_atomic_t_function+0x60/0x60
[  249.793571]  md_make_request+0xe3/0x220
[  249.800976]  generic_make_request+0xfc/0x260
[  249.808383]  submit_bio+0x64/0x120
[  249.815838]  ? submit_bio+0x64/0x120
[  249.823253]  ? xfs_setfilesize_trans_alloc.isra.4+0x2f/0x70
[  249.830757]  xfs_submit_ioend+0x70/0x190
[  249.838269]  xfs_vm_writepages+0x65/0x70
[  249.845668]  do_writepages+0x1e/0x30
[  249.852899]  __filemap_fdatawrite_range+0x71/0x90
[  249.860023]  filemap_write_and_wait_range+0x2a/0x70
[  249.867061]  xfs_file_fsync+0x54/0x190
[  249.874007]  vfs_fsync_range+0x49/0xa0
[  249.880934]  ? __fdget_pos+0x17/0x50
[  249.887793]  ? vfs_writev+0x3c/0x50
[  249.894700]  do_fsync+0x3d/0x70
[  249.901565]  SyS_fsync+0x10/0x20

I think we are stuck inside the

        if (mddev->suspended) {

block in md_make_request() in all these cases.

... and finally it becomes clear that mdadm is blocked and we are in
real trouble:

[  249.965701] INFO: task mdadm:2712 blocked for more than 120 seconds.
[  249.973003]       Not tainted 4.11.2-00005-g87b2117c3309-dirty #2
[  249.980381] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  249.987999] mdadm           D    0  2712   2020 0x00000000
[  249.995681] Call Trace:
[  250.003273]  __schedule+0x290/0x810
[  250.010885]  schedule+0x36/0x80
[  250.018503]  mddev_suspend+0xb3/0xe0
[  250.026158]  ? wake_atomic_t_function+0x60/0x60
[  250.033856]  level_store+0x1a7/0x6c0
[  250.041478]  ? md_ioctl+0xb7/0x1c10
[  250.049075]  ? putname+0x53/0x60
[  250.056554]  md_attr_store+0x83/0xc0
[  250.063922]  sysfs_kf_write+0x37/0x40
[  250.071082]  kernfs_fop_write+0x110/0x1a0
[  250.078127]  __vfs_write+0x28/0x120
[  250.085166]  ? kernfs_iop_get_link+0x172/0x1e0
[  250.092199]  ? __alloc_fd+0x3f/0x170
[  250.099271]  vfs_write+0xb6/0x1d0
[  250.106307]  SyS_write+0x46/0xb0
[  250.113227]  entry_SYSCALL_64_fastpath+0x13/0x94

I'm not sure what this is waiting on. It could be *any* of

        synchronize_rcu();
        wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
        mddev->pers->quiesce(mddev, 1);

since they all can probably schedule and if quiescing involves talking
to the device, we have already suspended it, so we're doomed.

I'm not sure how we can quiesce before we suspend and prevent things
from trying to issue more crucial requests (paging in mdadm?) before we
resume again, but I'm fairly sure that *something* different needs to be
done.