first mount(s) after unclean shutdown always fail

* first mount(s) after unclean shutdown always fail
@ 2020-07-01  0:51 Marc Lehmann
  2020-07-01  1:30 ` Qu Wenruo
  2020-07-02 18:31 ` Zygo Blaxell
  0 siblings, 2 replies; 11+ messages in thread
From: Marc Lehmann @ 2020-07-01  0:51 UTC (permalink / raw)
  To: linux-btrfs

Hi!

I have a server with multiple btrfs filesystems and some moderate-sized
dmcache caches (a few million blocks/100s of GBs).

When the server has an unclean shutdown, dmcache treats all cached blocks
as dirty. This has the effect of extremely slow I/O, as dmcache basically
caches a lot of random I/O, and writing these blocks back to the rotating
disk backing store can take hours. This, I think, is related to the
problem.

When the server is in this condition, then all btrfs filesystems on slow
stores (regardless of whether they use dmcache or not) fail their first
mount attempt(s) like this:

   [  173.243117] BTRFS info (device dm-7): has skinny extents
   [  864.982108] BTRFS error (device dm-7): open_ctree failed

Recent kernels sometimes additionally fail like this (super_total_bytes):

   [  867.721885] BTRFS info (device dm-7): turning on sync discard
   [  867.722341] BTRFS info (device dm-7): disk space caching is enabled
   [  867.722691] BTRFS info (device dm-7): has skinny extents
   [  871.257020] BTRFS error (device dm-7): super_total_bytes 858976681984 mismatch with fs_devices total_rw_bytes 1717953363968
   [  871.257487] BTRFS error (device dm-7): failed to read chunk tree: -22
   [  871.269989] BTRFS error (device dm-7): open_ctree failed

all the filesystems in question are mounted twice during normal boots,
with diferent subvolumes, and systemd parallelises these mounts. This might
play a role in these failures.

Simply trying to mount the filesystems again then (usually) succeeds with
seemingly no issues, so these are spurious mount failures. These repeated
mount attewmpts are also much faster, presumably because a lot of the data
is already in memory.

As far as I am concerned, this is 100% reproducible (i.e. it happens on every
unclean shutdown). It also happens on "old" (4.19 era) filesystems as well as
on filesystems that have never seen anything older than 5.4 kernels.

It does _not_ happen with filesystems on SSDs, regardless of whether they
are mounted multiple times or not. It does happen to all filesystems that
are on rotating disks affected by dm-cache writes, regardless of whether
the filesystem itself uses dmcache or not.

The system in question is currently running 5.6.17, but the same thing
happens with 5.4 and 5.2 kernels, and it might have happened with much
earlier kernels as well, but I didn't have time to report this (as I
secretly hoped newer kernels would fix this, and unclean shutdowns are
rare).

Example btrfs kernel messages for one such unclean boot. This involved
normal boot, followed by unsuccessfull "mount -va" in the emergency shell
(i.e. a second mount fasilure for the same filesystem), followed by a
successfull "mount -va" in the shell.

[  122.856787] BTRFS: device label LOCALVOL devid 1 transid 152865 /dev/mapper/cryptlocalvol scanned by btrfs (727)
[  173.242545] BTRFS info (device dm-7): disk space caching is enabled
[  173.243117] BTRFS info (device dm-7): has skinny extents
[  363.573875] INFO: task mount:1103 blocked for more than 120 seconds.
the above message repeats multiple times, backtrace &c has been removed for clarity
[  484.405875] INFO: task mount:1103 blocked for more than 241 seconds.
[  605.237859] INFO: task mount:1103 blocked for more than 362 seconds.
[  605.252478] INFO: task mount:1211 blocked for more than 120 seconds.
[  726.069900] INFO: task mount:1103 blocked for more than 483 seconds.
[  726.084415] INFO: task mount:1211 blocked for more than 241 seconds.
[  846.901874] INFO: task mount:1103 blocked for more than 604 seconds.
[  846.916431] INFO: task mount:1211 blocked for more than 362 seconds.
[  864.982108] BTRFS error (device dm-7): open_ctree failed
[  867.551400] BTRFS info (device dm-7): turning on sync discard
[  867.551875] BTRFS info (device dm-7): disk space caching is enabled
[  867.552242] BTRFS info (device dm-7): has skinny extents
[  867.565896] BTRFS error (device dm-7): open_ctree failed
[  867.721885] BTRFS info (device dm-7): turning on sync discard
[  867.722341] BTRFS info (device dm-7): disk space caching is enabled
[  867.722691] BTRFS info (device dm-7): has skinny extents
[  871.257020] BTRFS error (device dm-7): super_total_bytes 858976681984 mismatch with fs_devices total_rw_bytes 1717953363968
[  871.257487] BTRFS error (device dm-7): failed to read chunk tree: -22
[  871.269989] BTRFS error (device dm-7): open_ctree failed
[  872.535935] BTRFS info (device dm-7): disk space caching is enabled
[  872.536438] BTRFS info (device dm-7): has skinny extents

Example fstab entries for the mounts above:

/dev/mapper/cryptlocalvol       /localvol       btrfs           defaults,nossd,discard                  0       0
/dev/mapper/cryptlocalvol       /cryptlocalvol  btrfs           defaults,nossd,subvol=/                 0       0

I don't need assistance, I merely write this in the hope of btrfs being
improved by this information.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 11+ messages in thread