On 2020/7/1 下午6:16, Illia Bobyr wrote:
> On 6/30/2020 6:36 PM, Qu Wenruo wrote:
>> On 2020/7/1 上午3:41, Illia Bobyr wrote:
>>> Hi,
>>>
>>> I have a btrfs with bcache setup that failed during a boot yesterday.
>>> There is one SSD with bcache that is used as a cache for 3 btrfs HDDs.
>>>
>>> Reading through a number of discussions, I've decided to ask for advice here.
>>> Should I be running "btrfs check --recover"?
>>>
>>> The last message in the dmesg log is this one:
>>>
>>> Btrfs loaded, crc32c=crc32c-intel
>>> BTRFS: device label root devid 3 transid 138434 /dev/bcache2 scanned
>>> by btrfs (341)
>>> BTRFS: device label root devid 2 transid 138434 /dev/bcache1 scanned
>>> by btrfs (341)
>>> BTRFS: device label root devid 1 transid 138434 /dev/bcache0 scanned
>>> by btrfs (341)
>>> BTRFS info (device bcache0): disk space caching is enabled
>>> BTRFS info (device bcache0): has skinny extents
>>> BTRFS error (device bcache0): parent transid verify failed on
>>> 16984159518720 wanted 138414 found 138207
>>> BTRFS error (device bcache0): parent transid verify failed on
>>> 16984159518720 wanted 138414 found 138207
>>> BTRFS error (device bcache0): open_ctree failed
>> Looks like some tree blocks not written back correctly.
>>
>> Considering we don't have known write back related bugs with 5.6, I
>> guess bcache may be involved again?
> 
> A bit more details: the system started to misbehave.
> Interactive session was saying that the main file system became read/only.

Any dmesg of that RO event?
That would be the most valuable info to help us to locate the bug and
fix it.

I guess there is something wrong before that, and by somehow it
corrupted the extent tree, breaking the life keeping COW of metadata and
screwed up everything.

> And then the SSH disconnected and did not reconnect any more.
> It did not seem to reboot correctly after I've pressed the reboot
> button, so I did a hard rebooted.
> And now it could not mount the root partition any more.
>>> Trying to mount it in the recovery mode does not seem to work:
>>>
>>> [...]
>>>
>>> I have tried booting using a live ISO with 5.8.0 kernel and btrfs v5.6.1
>>> from http://defender.exton.net/.
>>> After booting tried mounting the bcache using the same command as above.
>>> The only message in the console was "Killed".
>>> /dev/kmsg on the other hand lists messages very similar to the ones I've
>>> seen in the initramfs environment: https://pastebin.com/Vhy072Mx
>> It looks like there is a chance to recover, as there is a rootbackup
>> with newer generation.
>>
>> While tree-checker is rejecting the newer generation one.
>>
>> The kernel panic is caused by some corner error handling with root
>> backups cleanups.
>> We need to fix it anyway.
>>
>> In this case, I guess "btrfs ins dump-super -fFa" output would help to
>> show if it's possible to recover.
> 
> Here is the output: https://pastebin.com/raw/DtJd813y

OK, the backup root is fine.

So this means, metadata COW is corrupted, which caused the transid mismatch.

> 
>> Anyway, something looks strange.
>>
>> The backup roots have a newer generation while the super block is still
>> old doesn't look correct at all.
> 
> Just in case, here is the output of "btrfs check", as suggested by "A L
> <mail@lechevalier.se>".  It does not seem to contain any new information.
> 
> parent transid verify failed on 16984014372864 wanted 138350 found 131117
> parent transid verify failed on 16984014405632 wanted 138350 found 131127
> parent transid verify failed on 16984013406208 wanted 138350 found 131112
> parent transid verify failed on 16984075436032 wanted 138384 found 131136
> parent transid verify failed on 16984075436032 wanted 138384 found 131136
> parent transid verify failed on 16984075436032 wanted 138384 found 131136
> Ignoring transid failure
> ERROR: child eb corrupted: parent bytenr=16984175853568 item=8 parent
> level=2 child level=0
> ERROR: failed to read block groups: Input/output error

Extent tree is completely screwed up, no wonder the transid error happens.

I don't believe it's reasonable possible to restore the fs to RW status.
The only remaining method left is btrfs-restore then.

> ERROR: cannot open file system
> Opening filesystem to check...
> 
> As I was running the commands I have accidentally run the following command:
> 
>     btrfs inspect-internal dump-super -fFa >/dev/bcache0 2>&1
> 
> Effectively overwriting the first 10kb of the partition :(

That's not a problem at all.
Btrfs reserves the first 0~1M space, so as long as you don't screw up
the super block at [64K, 68K) you're completely fine.

Thanks,
Qu
> 
> Seems like the superblock starts at 64kb.  So, I hope, this would not
> cause any more damage.
> 
> P.S. Thanks a lot for your reply Qu Wenruo!
> 
> Thank you,
> Illia
>