[-- Attachment #1: Type: text/plain, Size: 379 bytes --] Trying to recover a filesystem that was corrupted by losing writes due to a failing caching device, I get the following error: > ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0 Trying to zero the journal or reinitialising the extent tree yields the same error. Is there any way to recover the filesystem? Relevant logs attached. [-- Attachment #2: btrfs --] [-- Type: application/octet-stream, Size: 1007 bytes --] [liveuser@localhost-live btrfs-progs-5.3.1]$ ./btrfs --version btrfs-progs v5.3.1 [liveuser@localhost-live btrfs-progs-5.3.1]$ sudo ./btrfs check /dev/bcache0 Opening filesystem to check... parent transid verify failed on 2529691090944 wanted 319147 found 314912 parent transid verify failed on 2529691090944 wanted 319147 found 310171 parent transid verify failed on 2529691090944 wanted 319147 found 314912 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0 ERROR: cannot open file system [liveuser@localhost-live btrfs-progs-5.3.1]$ sudo ./btrfs rescue zero-log /dev/bcache0 parent transid verify failed on 2529691090944 wanted 319147 found 314912 parent transid verify failed on 2529691090944 wanted 319147 found 310171 parent transid verify failed on 2529691090944 wanted 319147 found 314912 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0 ERROR: could not open ctree [-- Attachment #3: dmesg --] [-- Type: application/octet-stream, Size: 514 bytes --] [ 207.230521] BTRFS info (device bcache1): disk space caching is enabled [ 207.230526] BTRFS info (device bcache1): has skinny extents [ 207.478890] BTRFS error (device bcache1): parent transid verify failed on 2529691090944 wanted 319147 found 310171 [ 207.491729] BTRFS error (device bcache1): parent transid verify failed on 2529691090944 wanted 319147 found 314912 [ 207.491741] BTRFS error (device bcache1): failed to read block groups: -5 [ 207.503087] BTRFS error (device bcache1): open_ctree failed [-- Attachment #4: Type: text/plain, Size: 12 bytes --] -- Gard
On 1.12.19 г. 19:27 ч., Gard Vaaler wrote: > Trying to recover a filesystem that was corrupted by losing writes due to a failing caching device, I get the following error: >> ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0 > > Trying to zero the journal or reinitialising the extent tree yields the same error. Is there any way to recover the filesystem? Relevant logs attached. Provide more information about your storage stack. > > > >
> 1. des. 2019 kl. 19:51 skrev Nikolay Borisov <nborisov@suse.com>:
> On 1.12.19 г. 19:27 ч., Gard Vaaler wrote:
>> Trying to recover a filesystem that was corrupted by losing writes due to a failing caching device, I get the following error:
>>> ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0
>>
>> Trying to zero the journal or reinitialising the extent tree yields the same error. Is there any way to recover the filesystem? Relevant logs attached.
>
> Provide more information about your storage stack.
Nothing special: SATA disks with (now-detached) SATA SSDs.
--
Gard
> 1. des. 2019 kl. 18:27 skrev Gard Vaaler <gardv@megacandy.net>: > > Trying to recover a filesystem that was corrupted by losing writes due to a failing caching device, I get the following error: >> ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0 > > Trying to zero the journal or reinitialising the extent tree yields the same error. Is there any way to recover the filesystem? Update: using 5.4, btrfs claims to have zeroed the journal: > [liveuser@localhost-live btrfs-progs-5.4]$ sudo ./btrfs rescue zero-log /dev/bcache0 > Clearing log on /dev/bcache0, previous log_root 2529694416896, level 0 ... but still complains about the journal on mount: > [ 703.964344] BTRFS info (device bcache1): disk space caching is enabled > [ 703.964347] BTRFS info (device bcache1): has skinny extents > [ 704.215748] BTRFS error (device bcache1): parent transid verify failed on 2529691090944 wanted 319147 found 310171 > [ 704.216131] BTRFS error (device bcache1): parent transid verify failed on 2529691090944 wanted 319147 found 314912 > [ 704.216137] BTRFS error (device bcache1): failed to read block groups: -5 > [ 704.227110] BTRFS error (device bcache1): open_ctree failed -- Gard
On Wed, Dec 4, 2019 at 8:50 AM Gard Vaaler <gardv@megacandy.net> wrote:
>
> > 1. des. 2019 kl. 18:27 skrev Gard Vaaler <gardv@megacandy.net>:
> >
> > Trying to recover a filesystem that was corrupted by losing writes due to a failing caching device, I get the following error:
> >> ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0
> >
> > Trying to zero the journal or reinitialising the extent tree yields the same error. Is there any way to recover the filesystem?
>
> Update: using 5.4, btrfs claims to have zeroed the journal:
>
> > [liveuser@localhost-live btrfs-progs-5.4]$ sudo ./btrfs rescue zero-log /dev/bcache0
> > Clearing log on /dev/bcache0, previous log_root 2529694416896, level 0
>
> ... but still complains about the journal on mount:
>
> > [ 703.964344] BTRFS info (device bcache1): disk space caching is enabled
> > [ 703.964347] BTRFS info (device bcache1): has skinny extents
> > [ 704.215748] BTRFS error (device bcache1): parent transid verify failed on 2529691090944 wanted 319147 found 310171
> > [ 704.216131] BTRFS error (device bcache1): parent transid verify failed on 2529691090944 wanted 319147 found 314912
> > [ 704.216137] BTRFS error (device bcache1): failed to read block groups: -5
> > [ 704.227110] BTRFS error (device bcache1): open_ctree failed
>
Why do you think it's complaining about the journal? I'm not seeing
tree log related messages here. Is the output provided complete or are
there additional messages? What do you get for:
btrfs insp dump-s /dev/X
What kernel version was being used at the time of the first problem instance?
The transid messages above suggest some kind of failure to actually
commit what should have ended up on stable media. Also please provide:
btrfs-find-root /dev/
btrfs check --mode=lowmem /dev/
The latter will take a while and since it is an offline check will
need to be done in initramfs, or better from Live media which will
make it easier to capture the output. I recommend btrfs-progs not
older than 5.1.1 if possible. It is only for check, not with --repair,
so the version matters somewhat less if it's not too old.
--
Chris Murphy
[-- Attachment #1: Type: text/plain, Size: 482 bytes --] > 4. des. 2019 kl. 20:08 skrev Chris Murphy <lists@colorremedies.com>: > Why do you think it's complaining about the journal? I'm not seeing > tree log related messages here. Thanks for the reply! That must be a misunderstanding on my part (it's called "transid", which suggested something in the journal to me). > Is the output provided complete or are > there additional messages? No, that's it. > What do you get for: > > btrfs insp dump-s /dev/X Attached. [-- Attachment #2: btrfs-insp-dump --] [-- Type: application/octet-stream, Size: 2682 bytes --] [liveuser@localhost-live btrfs-progs-5.4]$ sudo ./btrfs insp dump-s /dev/bcache0superblock: bytenr=65536, device=/dev/bcache0 --------------------------------------------------------- csum_type 0 (crc32c) csum_size 4 csum 0xdaa8bba5 [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid 8c4a9e0d-bfe9-4b8f-be8f-1899c58b00b3 metadata_uuid 8c4a9e0d-bfe9-4b8f-be8f-1899c58b00b3 label generation 319148 root 2529691058176 sys_array_size 129 chunk_root_generation 298799 root_level 1 chunk_root 2534052790272 chunk_root_level 1 log_root 0 log_root_transid 0 log_root_level 0 total_bytes 6000110088192 bytes_used 3739095216128 sectorsize 4096 nodesize 16384 leafsize (deprecated) 16384 stripesize 4096 root_dir 6 num_devices 2 compat_flags 0x0 compat_ro_flags 0x0 incompat_flags 0x161 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA ) cache_generation 319148 uuid_tree_generation 13 dev_item.uuid 7215ede5-5997-47c2-96e3-4b43f67f1eb6 dev_item.fsid 8c4a9e0d-bfe9-4b8f-be8f-1899c58b00b3 [match] dev_item.type 0 dev_item.total_bytes 2000398925824 dev_item.bytes_used 1515066556416 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size 4096 dev_item.devid 2 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 [liveuser@localhost-live btrfs-progs-5.4]$ sudo ./btrfs insp dump-s /dev/bcache1 superblock: bytenr=65536, device=/dev/bcache1 --------------------------------------------------------- csum_type 0 (crc32c) csum_size 4 csum 0xf1f043cd [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid 8c4a9e0d-bfe9-4b8f-be8f-1899c58b00b3 metadata_uuid 8c4a9e0d-bfe9-4b8f-be8f-1899c58b00b3 label generation 319148 root 2529691058176 sys_array_size 129 chunk_root_generation 298799 root_level 1 chunk_root 2534052790272 chunk_root_level 1 log_root 0 log_root_transid 0 log_root_level 0 total_bytes 6000110088192 bytes_used 3739095216128 sectorsize 4096 nodesize 16384 leafsize (deprecated) 16384 stripesize 4096 root_dir 6 num_devices 2 compat_flags 0x0 compat_ro_flags 0x0 incompat_flags 0x161 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA ) cache_generation 319148 uuid_tree_generation 13 dev_item.uuid 6f60e735-3829-4223-aa13-dbb377fa28ff dev_item.fsid 8c4a9e0d-bfe9-4b8f-be8f-1899c58b00b3 [match] dev_item.type 0 dev_item.total_bytes 3999711162368 dev_item.bytes_used 3407004565504 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size 4096 dev_item.devid 1 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 [-- Attachment #3: Type: text/plain, Size: 317 bytes --] > What kernel version was being used at the time of the first problem instance? Fedora's 5.2.8-300 kernel. > The transid messages above suggest some kind of failure to actually > commit what should have ended up on stable media. Also please provide: > > btrfs-find-root /dev/ Attached (compressed). [-- Attachment #4: btrfs-find-root.xz --] [-- Type: application/octet-stream, Size: 8096 bytes --] [-- Attachment #5: Type: text/plain, Size: 47 bytes --] > btrfs check --mode=lowmem /dev/ Attached. [-- Attachment #6: btrfs-check --] [-- Type: application/octet-stream, Size: 1086 bytes --] [liveuser@localhost-live btrfs-progs-5.4]c$ sudo ./btrfs check --mode=lowmem /dev/bcache0 Opening filesystem to check... parent transid verify failed on 2529691090944 wanted 319147 found 314912 parent transid verify failed on 2529691090944 wanted 319147 found 310171 parent transid verify failed on 2529691090944 wanted 319147 found 314912 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0 ERROR: failed to read block groups: Input/output error ERROR: cannot open file system [liveuser@localhost-live btrfs-progs-5.4]$ sudo ./btrfs check --mode=lowmem /dev/bcache1 Opening filesystem to check... parent transid verify failed on 2529691090944 wanted 319147 found 314912 parent transid verify failed on 2529691090944 wanted 319147 found 310171 parent transid verify failed on 2529691090944 wanted 319147 found 314912 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0 ERROR: failed to read block groups: Input/output error ERROR: cannot open file system [-- Attachment #7: Type: text/plain, Size: 568 bytes --] > The latter will take a while and since it is an offline check will > need to be done in initramfs, or better from Live media which will > make it easier to capture the output. I recommend btrfs-progs not > older than 5.1.1 if possible. It is only for check, not with --repair, > so the version matters somewhat less if it's not too old. As you can see, it terminates almost immediately with an IO error. However, there's no error in the dmesg on the underlying device, which makes me think there's a bad bounds check or something similar. -- Gard
On Wed, Dec 4, 2019 at 1:17 PM Gard Vaaler <gardv@megacandy.net> wrote: > > 4. des. 2019 kl. 20:08 skrev Chris Murphy <lists@colorremedies.com>: > Why do you think it's complaining about the journal? I'm not seeing > tree log related messages here. > > > Thanks for the reply! That must be a misunderstanding on my part (it's called "transid", which suggested something in the journal to me). Gotcha, yeah transid is just a way Btrfs keeps track of separate commits over time. In effect the file system itself is the journal, there is no separate dedicated journal on Btrfs like you see on ext4 or XFS. For fsync performance enhancement there is a log tree which might be somewhat like a journal, which is what zero log is wiping away. Pretty much on all file systems, it's best to allow log replay to happen before zeroing it, and only zeroing it if there's a problem reported about it, rather than as an early trouble shooting step. > > Is the output provided complete or are > there additional messages? > > > No, that's it. > > What do you get for: > > btrfs insp dump-s /dev/X OK so no log tree, therefore not related. > > > Attached. > > What kernel version was being used at the time of the first problem instance? > > > Fedora's 5.2.8-300 kernel. There's a decent chance this is the cause of the problem. That kernel does not have the fix for this bug: https://www.spinics.net/lists/stable-commits/msg129532.html https://bugzilla.redhat.com/show_bug.cgi?id=1751901 As far as I'm aware the corruption isn't fixable. You might still be able to mount the file system ro to get data out; if not then decent chance you can extract data with btrfs restore, which is an offline scraping tool, but it is a bit tedious to use. https://btrfs.wiki.kernel.org/index.php/Restore The real fix it to make a new Btrfs file system, and don't use kernels 5.2.0-5.2.14. Fedora 29, 30, 31 all are long since on 5.3.x series kernels which do have this fix incorporated. But the fix found it's way into 5.2.15 pretty soon after discovery so I'm gonna guess you've got updates disabled and just got unlucky to get hit by this bug. > > The transid messages above suggest some kind of failure to actually > commit what should have ended up on stable media. Also please provide: > > btrfs-find-root /dev/ > > > Attached (compressed). > > btrfs check --mode=lowmem /dev/ > > > Attached. > > The latter will take a while and since it is an offline check will > need to be done in initramfs, or better from Live media which will > make it easier to capture the output. I recommend btrfs-progs not > older than 5.1.1 if possible. It is only for check, not with --repair, > so the version matters somewhat less if it's not too old. > > > As you can see, it terminates almost immediately with an IO error. However, there's no error in the dmesg on the underlying device, which makes me think there's a bad bounds check or something similar. -- Chris Murphy
> 4. des. 2019 kl. 22:09 skrev Chris Murphy <lists@colorremedies.com>: > There's a decent chance this is the cause of the problem. That kernel > does not have the fix for this bug: > https://www.spinics.net/lists/stable-commits/msg129532.html > https://bugzilla.redhat.com/show_bug.cgi?id=1751901 > > As far as I'm aware the corruption isn't fixable. You might still be > able to mount the file system ro to get data out; if not then decent > chance you can extract data with btrfs restore, which is an offline > scraping tool, but it is a bit tedious to use. > https://btrfs.wiki.kernel.org/index.php/Restore That was my first thought too, but it seems too coincidental that I should happen across this bug at the same instant as my cache device failing. btrfs-restore doesn't like my filesystem either: > [liveuser@localhost-live btrfs-progs-5.4]$ sudo ./btrfs restore -Divvv /dev/bcache0 /mnt > This is a dry-run, no files are going to be restored > parent transid verify failed on 3719816445952 wanted 317513 found 313040 > parent transid verify failed on 3719816445952 wanted 317513 found 308297 > parent transid verify failed on 3719816445952 wanted 317513 found 313040 > Ignoring transid failure > leaf parent key incorrect 3719816445952 > Error searching -1 -- Gard
On Wed, Dec 4, 2019 at 5:34 PM Gard Vaaler <gardv@megacandy.net> wrote: > > > 4. des. 2019 kl. 22:09 skrev Chris Murphy <lists@colorremedies.com>: > > There's a decent chance this is the cause of the problem. That kernel > > does not have the fix for this bug: > > https://www.spinics.net/lists/stable-commits/msg129532.html > > https://bugzilla.redhat.com/show_bug.cgi?id=1751901 > > > > As far as I'm aware the corruption isn't fixable. You might still be > > able to mount the file system ro to get data out; if not then decent > > chance you can extract data with btrfs restore, which is an offline > > scraping tool, but it is a bit tedious to use. > > https://btrfs.wiki.kernel.org/index.php/Restore > > That was my first thought too, but it seems too coincidental that I should happen across this bug at the same instant as my cache device failing. btrfs-restore doesn't like my filesystem either: You know, I totally glossed over the cache device failing part of the very first message 8-\ But yeah it would seem like the cache device dropped a bunch of metadata. Really a lot more than I'd expect from the aforementioned kernel bug. So chances are your suspicion is spot on. > > [liveuser@localhost-live btrfs-progs-5.4]$ sudo ./btrfs restore -Divvv /dev/bcache0 /mnt > > This is a dry-run, no files are going to be restored > > parent transid verify failed on 3719816445952 wanted 317513 found 313040 > > parent transid verify failed on 3719816445952 wanted 317513 found 308297 > > parent transid verify failed on 3719816445952 wanted 317513 found 313040 > > Ignoring transid failure > > leaf parent key incorrect 3719816445952 > > Error searching -1 You might have to to try a lot of the btrfs-find-root block addresses (start with highest transid working down) with btrfs restore -t option to force it to use older roots. Maybe one of them will be intact. It's also possible to isolate to a subvolume, if you have home on a subvolume for example. Unfortunately btrfs restore isn't a simple scraper, it doesn't iterate. You have to do that part. It is tedious. -- Chris Murphy
[-- Attachment #1: Type: text/plain, Size: 1481 bytes --] On Mon, Dec 02, 2019 at 10:27:49PM +0100, Gard Vaaler wrote: > > 1. des. 2019 kl. 19:51 skrev Nikolay Borisov <nborisov@suse.com>: > > On 1.12.19 г. 19:27 ч., Gard Vaaler wrote: > >> Trying to recover a filesystem that was corrupted by losing writes due to a failing caching device, I get the following error: > >>> ERROR: child eb corrupted: parent bytenr=2529690976256 item=0 parent level=2 child level=0 > >> > >> Trying to zero the journal or reinitialising the extent tree yields the same error. Is there any way to recover the filesystem? Relevant logs attached. > > > > Provide more information about your storage stack. > > > Nothing special: SATA disks with (now-detached) SATA SSDs. Is it a pair of 2x (bcache-on-disk) in raid1? Did both cache devices fail? Were they configured as writeback cache? Does the drive firmware have bugs that affect either btrfs or bcache? If the caches are independent (no shared caches or disks), and you had only one cache device failure, and the filesystem is btrfs raid1, then the non-failing cache should be OK, and can be used to recover the contents of failed device. You'll need at least one pair of cache and disk to be up and running. If any of those conditions are false then it's probably toast. btrfs will reject a filesystem missing just one write--a filesystem missing thousands or millions of writes due to a writeback cache failure is going to be data soup. > -- > Gard > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --]