Write-hanging on XFS 5.1.15-5.16 - xfsaild/dm blocked - xlog_cli_push_work

* Write-hanging on XFS 5.1.15-5.16 - xfsaild/dm blocked - xlog_cli_push_work - xfs_log_worker
@ 2019-09-21  2:25 James Harvey
  0 siblings, 0 replies; only message in thread
From: James Harvey @ 2019-09-21  2:25 UTC (permalink / raw)
  To: linux-xfs

This is for XFS, bear with me...  In QEMU, I was having trouble with a
Btrfs filesystem with heavy I/O for a few hours going into a state
where until rebooted, anything writing to it would go into
uninterruptible sleep, but still usually allowed reads.

I tried XFS ad an alternative to Btrfs, to determine if this was the
fault of Btrfs or something lower-level like QEMU.  XFS had the same
exact symptoms, of going into this bad state within a few hours of
heavy I/O.  This led me to conclude it was probably a QEMU bug.

Turns out it wasn't.  XFS and Btrfs seem to have had similar looking
bugs.  Others had the same Btrfs problem, and it wound up being
discussed and a patch linked to here:
https://lore.kernel.org/linux-btrfs/CAL3q7H4peDv_bQa5vGJeOM=V--yq1a1=aHat5qcsXjbnDoSkDQ@mail.gmail.com/

I've been running the Btrfs patch for several days without a lockup,
which is way longer than I could go before.

I'm therefore concluding there wasn't a QEMU bug, which also leads me
to conclude the XFS crashes I experienced must have been an XFS bug.

I want to be upfront that although I will be happy to respond to
questions as well as I can, I won't be able to spend time trying
proposed patches or perform further diagnostics.  If that means this
bugreport never gets looked into, that's fine.  I'm not sending this
for it to be fixed for me, but just for everyone else.

I did find someone else with the same problem here:
https://superuser.com/questions/1458253/hanging-xfs-filesystem-on-encrypted-usb-device

You'll see I was able to replicate this within a couple of hours of
booting.  Gaps (i.e. Jul 4 - Jul 23) were when I was working on
something else or using Btrfs again, and do not indicate periods of it
working.  It never worked for more than a few hours with heavy I/O

When I mean heavy I/O, I mean saturating a Samsung 970 EVO 1TB with
random access.

You'll see at the end, the filesystem eventually got I/O errors.
There's definitely no hardware issue.  I was able to replicate this on
an identical system with completely different actual pieces of
hardware.

Like the superuser.com poster that I linked to above, I was using
LUKS.  I, of course, wasn't using a USB drive.

You can see all the relevant portions of journalctl with the
backtraces here: http://ix.io/1W7s

But, for searchability, I've included a portion of it here:

INFO: task xfsaild/dm-8:3642 blocked for more than 122 seconds.
      Not tainted 5.1.15.a-1-hardened #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
xfsaild/dm-8    D    0  3642      2 0x80000080
Call Trace:
 ? __schedule+0x27c/0x8d0
 schedule+0x3c/0x80
 xfs_log_force+0x18d/0x310 [xfs]
 ? wake_up_q+0x70/0x70
 xfsaild+0x1c6/0x810 [xfs]
 ? sched_clock_cpu+0x10/0xd0
 kthread+0xfd/0x130
 ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

^ permalink raw reply	[flat|nested] only message in thread