Repeating csum errors on block group with increasing inode id

* Repeating csum errors on block group with increasing inode id
@ 2022-06-16  6:35 Zachary Bischof
  0 siblings, 0 replies; only message in thread
From: Zachary Bischof @ 2022-06-16  6:35 UTC (permalink / raw)
  To: linux-btrfs

Greetings,

I'm running into some issues when I try to delete or resize a device
in a RAID 6 data volume (RAID1C4 metadata). Basically, when I run a
delete or resize command on this device, it tries to relocate a
particular block group but hits a csum error. If I re-run the command,
it finds a csum error in the same block group but with the inode
increased by one. The csum errors occur on root -9, which from the
btrfs wiki/mailing list I've figured out is the reloc tree. Here is a
sample of relevant logs from dmesg:

[  203.248225] BTRFS info (device sdo): resizing devid 8
...
[  809.703041] BTRFS info (device sdo): relocating block group
185056628310016 flags data|raid6
[  833.688667] btrfs_print_data_csum_error: 337 callbacks suppressed
[  833.688669] btrfs_dev_stat_print_on_error: 15 callbacks suppressed
[  833.688681] BTRFS warning (device sdo): csum failed root -9 ino 280
off 9034416128 csum 0x63066980 expected csum 0x5764eda5 mirror 1
[  833.688681] BTRFS error (device sdo): bdev /dev/sdj errs: wr 0, rd
0, flush 0, corrupt 150, gen 0
[  833.688749] BTRFS error (device sdo): bdev /dev/sdj errs: wr 0, rd
0, flush 0, corrupt 151, gen 0
[  833.689485] BTRFS warning (device sdo): csum failed root -9 ino 280
off 9034420224 csum 0x5d9a36af expected csum 0x5b59723b mirror 1
[  833.689497] BTRFS error (device sdo): bdev /dev/sdj errs: wr 0, rd
0, flush 0, corrupt 152, gen 0
..
[ 1387.923903] BTRFS info (device sdo): resizing devid 8
[ 1388.094073] BTRFS info (device sdo): relocating block group
185056628310016 flags data|raid6
[ 1410.117987] btrfs_print_data_csum_error: 337 callbacks suppressed
[ 1410.117996] BTRFS warning (device sdo): csum failed root -9 ino 281
off 9033695232 csum 0x406573b2 expected csum 0x286f7488 mirror 1
[ 1410.118008] btrfs_dev_stat_print_on_error: 15 callbacks suppressed
[ 1410.118011] BTRFS error (device sdo): bdev /dev/sdj errs: wr 0, rd
0, flush 0, corrupt 157, gen 0
[ 1410.118020] BTRFS warning (device sdo): csum failed root -9 ino 281
off 9034416128 csum 0x63066980 expected csum 0x5764eda5 mirror 1
[ 1410.121966] BTRFS error (device sdo): bdev /dev/sdj errs: wr 0, rd
0, flush 0, corrupt 158, gen 0
[ 1410.122623] BTRFS warning (device sdo): csum failed root -9 ino 281
off 9033699328 csum 0x08248cb7 expected csum 0xf7931ed7 mirror 1
[ 1410.125626] BTRFS error (device sdo): bdev /dev/sdj errs: wr 0, rd
0, flush 0, corrupt 159, gen 0

This continued for the next dozen or so consecutive inodes. I've used
inspect-internal to check a few inodes at random and for all the
inodes that I've checked that point to files (a number referred to
directories) I've run a diff against copies on a backup and found no
differences (though now I'm wondering if a corrupted file was copied
to this particular backup). As a side note, I did switch to
space_cache=v2 since this volume was created. In a previous message
regarding negative root numbers (-9) on the mailing list, Qu suggested
running scrub to find the offending file, but running scrub on this
device completes without finding any errors. I'm still running scrub
on the other devices (one at a time) but have yet to find any errors.
The high counter number for "corrupt" in the error messages above is
from me repeatedly trying unsuccessfully to see if using any other
commands will get around this problematic block group.

Anyway, I'd appreciate any suggestions on next steps to delete the
device. Is there a way to clear out the reloc tree (if this makes
sense)? I was wondering if the inode in the error messages has moved
to the next inode because the previous was repaired when read? Would
it make sense to repeatedly try to delete/resize the device until it
gets through all the inodes in this block group? At this point I've
seen dozens of errors like this, but wasn't sure if I should stop. If
this approach does make sense, is there a way to monitor progress
(e.g., a list of all the inodes in this block group)? Alternatively,
should I consider it a lost cause and reformat?

Apologies if answers to any of these questions were already covered
somewhere else.

Cheers,
Zachary

^ permalink raw reply	[flat|nested] only message in thread