all(!) btrfs filesystems stuck + CPU soft lockup after btrfs mv

* all(!) btrfs filesystems stuck + CPU soft lockup after btrfs mv
@ 2022-10-23 13:06 Christoph Anton Mitterer
  0 siblings, 0 replies; only message in thread
From: Christoph Anton Mitterer @ 2022-10-23 13:06 UTC (permalink / raw)
  To: linux-btrfs

Hey.
I just encountered a really weird issue on kernel 6.0.3.

The system's root-fs runs on btrfs and I've attached an external USB
HDD, also with btrfs.
All the filesystems are on top of dm-crypt/LUKS and all these ones
still use space cache v1.

On the latter, I had something like:
 backups/...
 snapshots/_external-fs/heisenberg.scientia.net/.../2022-03-06_1
 snapshots/_external-fs/.../2022-06-26_1
and so on, where 2022-03-06_1, were ro-snapshots (incrementally created
via send|receive).

Now I wanted to move all heisenberg.scientia.net to backups/ but the mv
gave a Read-only filesystem error, but for each of the snapshot
directories (I'm pretty sure the fs *was* mounted rw).

So without much thinking I remembered that one couldn't always mv ro-
snapshots because their .. would need to change, so I did the following
instead:
 rmdir backups/
 mv snapshots/_external-fs/ .
 mv _external-fs/ backups

That got stuck already and kernel printed out some:
 rcu: INFO: rcu_preempt self-detected stall on CPU
 ...
 (see photos)

The mv couldn't be killed (neither with SIGKILL) and seemingly forever
I got every so and so many seconds a speaker beep, followed by another
call trace to the kernel with:
 watchdog: BUG: soft lockup - CPU#2 stuck for xxs! ...

I figured just unplugging the USB HDD might help the kernel to recover
- it did not.

It seemed as if I even couldn't write (or read?) anymore on the
system's root-fs (also btrfs).
While the desktop environment and the windows/terminals I had alread
opened continued to work (to some extent), and I luckily had one open
already that did dmesg | tail -f ... anything like opening a new
terminal got stuck.
Even ssh-ing to another system didn't work anymore (tried to copy the
logs away).

So unfortunately, only photos from the error messages:
https://drive.google.com/drive/folders/1iC_LVcvXUxuZEvCU33gFXP3nEExngNKY?usp=sharing

I had to hard-power-off the system.

Next I booted it from a rescue USB (though with slightly older
kernel/btrfsprogs) and --mode=original + --mode=lowmem btrfs-checked
the systems root-fs... no error was found there.
Neither did Debian's debsums found any errors. A btrfs scrub is still
running.

Back in the system (thus kernel 6.0.3 again and progs 6.0), a
--mode=original check on the external HDD got:
# btrfs check /dev/mapper/data-b-1 ; echo $?
Opening filesystem to check...
Checking filesystem on /dev/mapper/data-b-1
UUID: fb93f31e-b0e6-4254-bb2a-482d79309725
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
block group 2130334187520 has wrong amount of free space, free space cache has 212041728 block group has 231915520
failed to load free space cache for block group 2130334187520
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 5167160913920 bytes used, no error found
total csum bytes: 5037130492
total tree bytes: 8741093376
total fs tree bytes: 2812231680
total extent tree bytes: 229113856
btree space waste bytes: 1123300531
file data blocks allocated: 8774038773760
 referenced 6790890844160
0

No idea whether the cache corruption is from that incident or just a
coincidence.

Mounting it shows a filesystem at the following state:

- empty backups/
 which is, if you closely follow my photos not quite the "original"
 state, as it previously contained a few empty directories
- _external-fs/ back in snapshots/

After that, I tried to do what I wanted originally, that is
send|receive a current backup from the system's root-fs to the external
HDD.

Nearly immediately after I ran the send | receive ... the system
completely stuck (like not even the mouse was moving anymore) and after
a while I heard the beeps again (so presumably soft CPU lockup again)?

I repeated the game with booting from the rescue USB and
original/lowmem fsck, again no errors on the system fs.

I next booted a slightly older 6.0.2, went in systemd rescue mode (so
no desktop environment running) and did the send | receive there.
That finished (seemingly) correctly.

Then I booted 6.0.3 again, also rescue mode, and tried yet another send
| receive... though I Ctrl-Ced it after a while... it seemed to run -
at least the system didn't freeze.

So no idea what has happened there.

Any ideas?

Thanks,
Chris.

^ permalink raw reply	[flat|nested] only message in thread