On 2020/7/2 上午5:36, Illia Bobyr wrote: > On 7/1/2020 3:48 AM, Qu Wenruo wrote: >> On 2020/7/1 下午6:16, Illia Bobyr wrote: >>> On 6/30/2020 6:36 PM, Qu Wenruo wrote: >>>> On 2020/7/1 上午3:41, Illia Bobyr wrote: >>>>> [...] >>>> Looks like some tree blocks not written back correctly. >>>> >>>> Considering we don't have known write back related bugs with 5.6, I >>>> guess bcache may be involved again? >>> A bit more details: the system started to misbehave. >>> Interactive session was saying that the main file system became read/only. >> Any dmesg of that RO event? >> That would be the most valuable info to help us to locate the bug and >> fix it. >> >> I guess there is something wrong before that, and by somehow it >> corrupted the extent tree, breaking the life keeping COW of metadata and >> screwed up everything. > > After I will restore the data, I will check the kernel log to see if > there are any messages in there. > Will post here if I will find anything. > >>> [...] >>>> In this case, I guess "btrfs ins dump-super -fFa" output would help to >>>> show if it's possible to recover. >>> Here is the output: https://pastebin.com/raw/DtJd813y >> OK, the backup root is fine. >> >> So this means, metadata COW is corrupted, which caused the transid mismatch. >> >>>> Anyway, something looks strange. >>>> >>>> The backup roots have a newer generation while the super block is still >>>> old doesn't look correct at all. >>> Just in case, here is the output of "btrfs check", as suggested by "A L >>> ".  It does not seem to contain any new information. >>> >>> parent transid verify failed on 16984014372864 wanted 138350 found 131117 >>> parent transid verify failed on 16984014405632 wanted 138350 found 131127 >>> parent transid verify failed on 16984013406208 wanted 138350 found 131112 >>> parent transid verify failed on 16984075436032 wanted 138384 found 131136 >>> parent transid verify failed on 16984075436032 wanted 138384 found 131136 >>> parent transid verify failed on 16984075436032 wanted 138384 found 131136 >>> Ignoring transid failure >>> ERROR: child eb corrupted: parent bytenr=16984175853568 item=8 parent >>> level=2 child level=0 >>> ERROR: failed to read block groups: Input/output error >> Extent tree is completely screwed up, no wonder the transid error happens. >> >> I don't believe it's reasonable possible to restore the fs to RW status. >> The only remaining method left is btrfs-restore then. > > There are no more available SATA connections in the system and there is > a lot of data in that FS (~7TB). > I do not immediately have another disk that would be able to hold this much. > > At the same time this FS is RAID0. > I wonder if there is a way to first check if restore will work should I > will disconnect half of the disks, as each half contains all the data. > And then if it does, I would be able to restore by reusing the space on > of the mirrors. Yes, there is. We have the out-of-tree rescue mount options patchset. It allows you to mount the fs RO, with extent tree completely corrupted. It's in David's misc-next branch already: https://github.com/kdave/btrfs-devel/tree/misc-next Then you can try to mount the fs with "-o ro,rescue=skipbg,rescue=nologreplay" and do your tests on what can be salvaged and what can not as if your fs is still alive. This should provide a more flex solution compared to btrfs-restore, but it needs to compile the kernel. > > I see "-D: Dry run" that can be passed to "btrfs restore", but, I guess, > it would not really do a full check of the data, making sure that the > restore would really succeed, does it? It would only check the metadata, but that should cover most of the risks. Thanks, Qu > > Is there a way to perform this kind of check? > Or is "btrfs restore" the only option at the moment? >