USB reset + raid6 = majority of files unreadable

* USB reset + raid6 = majority of files unreadable
@ 2020-02-25 20:39 Jonathan H
  2020-02-25 23:58 ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan H @ 2020-02-25 20:39 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

Previously, I was running an array with six disks all connected via
USB. I am running raid1c3 for metadata and raid6 for data, kernel
5.5.4-arch1-1 and btrfs --version v5.4, and I use bees for
deduplication. Four of the six drives are stored in a single four-bay
enclosure. Due to my oversight, TLER was not enabled for any of the
drives, so when one of them started failing, the enclosure was reset
and all four drives were disconnected.

After rebooting, the file system was still mountable. I saw some
transid errors in dmesg, but I didn't really pay attention to them
because I was trying to get rid of the now failed drive. I tried to
"btrfs replace" the drive with a different one, but the replace
stopped making progress because all reads to the dead drive in a
certain location were failing (even with the "-r") flag. So I tried
mounting degraded without the dead drive and doing "btrfs dev delete
missing" instead. The deletion failed with the following kernel
message:

[  +2.697798] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
[  +0.003381] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
[  +0.002514] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4
[  +0.000543] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
[  +0.001170] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
[  +0.001151] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4

I noticed that almost all of the files give an I/O error when read,
and similar kernel messages are generated, but with positive roots. I
also see "read error corrected" messages, but
if I try to read the files again, I the exact same messages are
printed again, which seems to suggest that the errors haven't really
been corrected? (But maybe this is intended behavior.)

I also attempted to use "btrfs restore" to recover the files, but
almost all of the files produce "ERROR: zstd decompress failed Unknown
frame descriptor" and the recovery does not succeed.

Since, then, I have been scrubbing the file system. The first scrub
produce lots of Uncorrectable read errors and several hundred csum
errors. I'm assuming the read errors are due to the missing drive. The
puzzling thing is, the scrub can "complete" (actually, it is aborted
after it completes on all drives but the missing one) and I can delete
all of the files with unrecoverable csum errors, but all of the issues
above persist. I can then turn around scrub again, and the scrub will
find new csum errors, which seems bizarre to me, since I would have
expected them all to be fixed. However, all transid related errors
have disappeared after the first scrub.

I have also tried deleting the file referenced in the device deletion
error and restarting the deletion. This seems to be working, but
progress has been very slow and I fear I'll have to delete all of the
I/O error-producing files above, which I would like to avoid if
possible.

What should I do in this situation and how can I avoid this in the future?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 24+ messages in thread