Re: BTRFS corruption: open_ctree failed

From: b11g <b11g@protonmail.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: BTRFS corruption: open_ctree failed
Date: Thu, 03 Jan 2019 13:55:54 +0000	[thread overview]
Message-ID: <OAUyvYQ65eFoxQY0GmEgh96YvtyWfbnvB6ebVJkADANt8NfUcRVct_IdyQduy58Dnuv2ExOhmP4MpOX1Q7Ln-GK7kUQDz6Gt2T7jYSMv3yM=@protonmail.com> (raw)
In-Reply-To: <CAJCQCtR8qEyw51Ox2jU1E4nWBVfV4j8_nEcMgz2gTJa=yhv=Tw@mail.gmail.com>

Responded in-line.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, 3 January 2019 05:52, Chris Murphy <lists@colorremedies.com> wrote:

> On Wed, Jan 2, 2019 at 5:26 PM b11g b11g@protonmail.com wrote:
>
> > Hi all,
> > I have several BTRFS success-stories, and I've been an happy user for quite a long time now. I was therefore surprised to face a BTRFS corruption on a system I'd just installed.
> > I use NixOS, unstable branch (linux kernel 4.19.12). The system runs on a SSD with an ext4 boot partition, a simple btrfs root with some subvolumes, and some swap space only used for hibernation. I was working on my server as normal when I noticed all of my BTRFS subvolumes had been remounted ro. After a short time, I started getting various IO errors ("bus error" by journalctl, "I/O error" by ls etc.). I halted the system (hard reboot), at the reboot the BTRFS partition would not mount. I suspected the corruption to be disk-related, but smartctl does not show any warning for the disk, and the ext4 partition seems healthy.
> > Those are the kernel messages logged when I attempt to mount the partition:
> > Jan 02 23:39:38 nixos kernel: BTRFS warning (device sdd2): sdd2 checksum verify failed on <L> wanted <A> found <B> level 0
> > Jan 02 23:39:38 nixos kernel: BTRFS error (device sdd2): failed to read block groups: -5
> > Jan 02 23:39:38 nixos systemd[1]: Started Cleanup of Temporary Directories.
> > Jan 02 23:39:38 nixos kernel: BTRFS error (device sdd2): open_ctree failed
>
> Do you have the entire kernel message from the previous boot when the
> problem started, including I/O errors? We kinda need to see what was
> going on leading up to the read only mount, and the bus and I/O
> errors. journalctl -b-1 -k should do it, or using journalctl
> --list-boots to find it. You can redirect to a file with > and then
> attach to the reply if it's small enough, or put it up somewhere like
> Dropbox or Google Drive if it's too big.

Sadly I cannot find the journal file relevant to the boot in which the system failed in /var/log - only older entries, with no I/O errors. If you have any idea on where to look for logs I can check.

>
> btrfs rescue super -v /dev/sdd2
All Devices:
        Device: id = 1, name = /dev/sdd2

Before Recovering:
        [All good supers]:
                device name = /dev/sdd2
                superblock bytenr = 65536

                device name = /dev/sdd2
                superblock bytenr = <big N>

        [All bad supers]:

All supers are valid, no need to recover

> btrfs insp dump-s -f /dev/sdd2
superblock: bytenr=65536, device=/dev/sdd2
---------------------------------------------------------
csum_type               0 (crc32c)
csum_size               4
csum                    0x<C> [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
fsid                    <ID>
label                   main
generation              6337
root                    <~10^10>
sys_array_size          97
chunk_root_generation   5976
root_level              1
chunk_root              <~10^7>
chunk_root_level        0
log_root                <~10^9>
log_root_transid        0
log_root_level          0
total_bytes             <X:~10^12>
bytes_used              <~10^12>
sectorsize              4096
nodesize                16384
leafsize (deprecated)           16384
stripesize              4096
root_dir                6
num_devices             1
compat_flags            0x0
compat_ro_flags         0x0
incompat_flags          0x169
                        ( MIXED_BACKREF |
                          COMPRESS_LZO |
                          BIG_METADATA |
                          EXTENDED_IREF |
                          SKINNY_METADATA )
cache_generation        6337
uuid_tree_generation    6337
dev_item.uuid           <ID2>
dev_item.fsid           <ID> [match]
dev_item.type           0
dev_item.total_bytes    <X:~10^12>
dev_item.bytes_used     <~10^12>
dev_item.io_align       4096
dev_item.io_width       4096
dev_item.sector_size    4096
dev_item.devid          1
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0
sys_chunk_array[2048]:
        item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM <Y>)
                length <L> owner 2 stripe_len 65536 type SYSTEM
                io_align 4096 io_width 4096 sector_size 4096
                num_stripes 1 sub_stripes 0
                        stripe 0 devid 1 offset <Y>
                        dev_uuid <ID2>
backup_roots[4]:
        backup 0:
<...>

>
> Those are reader only. And also try to mount with -o usebackuproot and
> if that fails -o ro,usebackuproot is often more tolerant. But that's
> for getting data off the volume, it's more useful to know why the file
> system broke. And also why btrfs check is failing, given that it's a
> current version.

I got the data back using btrfs restore, mount -o ro,usebackuproot fails with the same errors (open_ctree failed).

>
> If you get a chance you can take an image, maybe a Btrfs developer
> will find it useful to understand why the Btrfs check is failing.
>
>  <dev> /path/to/fileoutput.image
>
> That is usually around 1/2 the size of file system metadata. It
> contains no data and filenames will be hashed.
>
>
> ------------------------------------------------------------------------------------------------------------------
>
> Chris Murphy

I tried to take an image but even that fails:
"btrfs-image -c9 -t4 -ss /dev/sdd2 /mnt/metadata.image"
checksum verify failed on <N> found <A> wanted <B>
checksum verify failed on <N> found <A> wanted <B>
Csum didn't match
ERROR: open ctree failed
ERROR: create failed: Success

-b11g