[Resent because the message was too long for the list]

On Tue, 2020-08-11 at 13:17 -0600, Chris Murphy wrote:
> > > My advice is to mount ro, backup (or two copies for important info),
> > > and start with a new Btrfs file system and restore. It's not worth
> > > repairing.
> > Sigh, I was expecting I'd have to do this. At least no data was lost,
> > and the system still functions even though it's read-only. Do you think
> > check --repair is not worth trying? Everything of value is already
> > backed up, but restoring it would take many hours of work.
> 
> Metadata, RAID10: total=9.00GiB, used=7.57GiB
> 
> Ballpark 8 hours for --repair given metadata size and spinning drives.
> It'll add some time adding --init-extent-tree which... is decently
> likely to be needed here. So the gotcha is, see if --repair works, and
> it fixes some stuff but still needs extent tree repaired anyway. Now
> you have to do that and it could be another 8 hours. Or do you go with
> the heavy hammer right away to save time and do both at once? But the
> heavy hammer is riskier.
> 
> Whether repair or start over, you need to have the backup plus 2x for
> important stuff. To do the repair you need to be prepared for the
> possibility tihngs get worse. I'll argue strongly that it's a bug if
> things get worse (i.e. now you can't mount ro at all) but as a risk
> assessment, it has to be considered.

So, I've finally managed to get someone to add a disk to this system
and ran a btrfs check --repair. It failed almost immediately with:

Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/disk/by-label/Susanita
UUID: 4d3acf20-d408-49ab-b0a6-182396a9f27c
[1/7] checking root items
checksum verify failed on 10919566688256 found 0000006E wanted 00000066
checksum verify failed on 10919566688256 found 0000006E wanted 00000066
bad tree block 10919566688256, bytenr mismatch, want=10919566688256, have=17196831625821864417
ERROR: failed to repair root items: Input/output error

so I ran btrfs check --init-extent-tree, and it's still running after
24 hours. It seems to have processed 2 GiB of... something:

[2/7] checking extents                         (0:04:22 elapsed, 434185 items checked)
ref mismatch on [331916251136 4096] extent item 0, found 1
data backref 331916251136 parent 10915911958528 owner 0 offset 0 num_refs 0 not found in extent tree
incorrect local backref count on 331916251136 parent 10915911958528 owner 0 offset 0 found 1 wanted 0 back 0x557cdf7560f0
backpointer mismatch on [331916251136 4096]
adding new data backref on 331916251136 parent 10915911958528 owner 0 offset 0 found 1
Repaired extent references for 331916251136

[24 hours later]

[2/7] checking extents                         (23:47:26 elapsed, 434185 items checked)
ref mismatch on [334605303808 188416] extent item 0, found 2
data backref 334605303808 parent 10915986505728 owner 0 offset 0 num_refs 0 not found in extent tree
incorrect local backref count on 334605303808 parent 10915986505728 owner 0 offset 0 found 1 wanted 0 back 0x557ce0ac16c0
data backref 334605303808 root 10455 owner 219090 offset 921600 num_refs 0 not found in extent tree
incorrect local backref count on 334605303808 root 10455 owner 219090 offset 921600 found 1 wanted 0 back 0x557d14faebc0
backpointer mismatch on [334605303808 188416]
adding new data backref on 334605303808 parent 10915986505728 owner 0 offset 0 found 1
adding new data backref on 334605303808 root 10455 owner 219090 offset 921600 found 1
Repaired extent references for 334605303808

But now but I've got no idea if it's doing something useful or if I'd
better ^C it and give up with this filesystem. I attached the log of the ongoing repair and of a read-only check I ran immediately before.

Cheers.