Is this error fixable or do I need to rebuild the drive?

* Is this error fixable or do I need to rebuild the drive?
@ 2022-03-04 23:33 Jan Kanis
  2022-03-04 23:39 ` Qu Wenruo
  2022-03-07  0:42 ` Damien Le Moal
  0 siblings, 2 replies; 13+ messages in thread
From: Jan Kanis @ 2022-03-04 23:33 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I have a btrfs filesystem with two disks in raid 1. Each btrfs device
sits on top of a LUKS encrypted volume, which consists of a raw drive
partition on a SMR hard disk, though I don't think that's relevant.

One of the drives failed, the sata link appears to have died, if I'm
interpreting the system logs right. As it's a raid 1 the system kept
running and I didn't notice the dead drive until some time later,
during which I kept using the filesystem.
Something wasn't behaving right, so I decided to reboot. After the
reboot the btrfs filesystem didn't come up and one of the drives was
dead. I was able to mount from the remaining device with
degraded/read-only, all data seemed to be there.
I took out the dead drive and put it into another system for
examination. After some fiddling the drive came up again, so it wasn't
permanently dead after all. I was able to mount it degraded/read-only.
It looked good except for missing the latest changes I made to some
files I was working with, so it was a bit out of date. A btrfs scrub
showed no corruptions.
I put the drive back in the original system, thinking that btrfs would
either refuse to mount it or fix it from the other copy. The
filesystem automatically mounted rw without a 'degraded' option, and
the filesystem could be used again. The logs showed some "parent
transid verify failed" errors, which I assumed would be corrected from
the other copy. Attempting to mount only the drive that had failed
with degraded/read-only now no longer worked.

It's now some days later, the filesystem is still working, but I'm
also still getting "parent transid verify failed" errors in the logs,
and "read error corrected". So by now I'm thinking that btrfs
apparently does not fix this error by itself. What's happening here,
and why isn't btrfs fixing it, it has two copies of everything?
What's the best way to fix it manually? Rebalance the data? scrub it?
delete, wipe and re-add the device that failed so the mirror can be
rebuilt?

best,
Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread