Re: How to Fix 'Error: could not find extent items for root 257'?

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Chiung-Ming Huang <photon3108@gmail.com>
Cc: Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: How to Fix 'Error: could not find extent items for root 257'?
Date: Sat, 15 Feb 2020 12:29:51 +0800	[thread overview]
Message-ID: <b52cfa2d-ba1b-f013-c0f0-5c0cd0008210@gmx.com> (raw)
In-Reply-To: <CAEOGEKHrN_i2fSb=iWY3yCLRjCuU1Hn08trCyg0Br9fJasjK6A@mail.gmail.com>

[-- Attachment #1.1: Type: text/plain, Size: 6913 bytes --]

On 2020/2/15 上午11:47, Chiung-Ming Huang wrote:
> Hi Qu
> 
> Thanks for your reply. That's really helpful. BTW, I just read this url and
> the mail thread in it. https://unix.stackexchange.com/a/345972
> It seems to say if raid1 is degraded and even if rw, it should not be applied
> any operations other than btrfs-replace or btrfs-balance.

That would be the best case.

> 
> Does it mean the degraded raid1 should not be used with both
> btrfs-replace/balance and the original server rw services at the meantime?

No, as long as the fs is still mounted, degraded RAID1 can be pretty
safe in fact.
At least to me, all the problem happen when we try to mount the fs again
using a mix of up-to-date disks with out-of-data disk.

For running degraded fs, btrfs knows which device is missing, it just
submit read/write to existing devices, and replace/balance can all
handle the case where.

> 
> For example, I put PostgreSQL DB on btrfs raid1 and I though one of raid1
> two copies is my backup. Even if I lost one copy, the service still can keep
> running by another one immediately. Okay, maybe not immediately. I need
> to reboot.

You'd better not to reboot, at least not reboot directly to normal
running status, with the bad disk attached.

> But waiting 24 hours or longer which depends on the size of data
> for the completion of btrfs-replace/balance seems not to be a good idea.

Btrfs-replace works just like scrub, which can only copying/verify data
on certain disk. It's not rewriting/verifying the whole fs, but I
understand that it can be very slow.

For btrfs-replace, you can just run the replace in the background.
Replace has extra protection to avoid data out-of-sync.

In short, for your case, it looks the problem is between some of your
degraded mount which screwed up some metadata blocks due to metadata out
of sync.

To avoid such problem, it may be a good idea to allow btrfs to use
superblock generation to find out which device is out-of-data, and do
self re-silver or at least avoid reading data/meta from the old device.
But that feature will need extra consideration before we even trying to
implement.

So currently my only practical recommendation would be, if you find one
disk failing, please remove it completely and ensure it will never show
up before remount the fs.
Then you can safely replace/remount.

Thanks,
Qu
> 
> Regards,
> Chiung-Ming Huang
> 
> Regards,
> Chiung-Ming Huang
> 
> 
> Qu Wenruo <quwenruo.btrfs@gmx.com> 於 2020年2月10日 週一 下午3:03寫道：
>>
>>
>>
>> On 2020/2/10 下午2:50, Chiung-Ming Huang wrote:
>>> Qu Wenruo <quwenruo.btrfs@gmx.com> 於 2020年2月7日 週五 下午3:16寫道：
>>>>
>>>>
>>>>
>>>> On 2020/2/7 下午2:16, Chiung-Ming Huang wrote:
>>>>> Qu Wenruo <quwenruo.btrfs@gmx.com> 於 2020年2月7日 週五 下午12:00寫道：
>>>>>>
>>>>>> All these subvolumes had a missing root dir. That's not good either.
>>>>>> I guess btrfs-restore is your last chance, or RO mount with my
>>>>>> rescue=skipbg patchset:
>>>>>> https://patchwork.kernel.org/project/linux-btrfs/list/?series=170715
>>>>>>
>>>>>
>>>>> Is it possible to use original disks to keep the restored data safely?
>>>>> I would like
>>>>> to restore the data of /dev/bcache3 to the new btrfs RAID0 at the first and then
>>>>> add it to the new btrfs RAID0. Does `btrfs restore` need metadata or something
>>>>> in /dev/bcache3 to restore /dev/bcache2 and /dev/bcache4?
>>>>
>>>> Devid 1 (bcache 2) seems OK to be missing, as all its data and metadata
>>>> are in RAID1.
>>>>
>>>> But it's strongly recommended to test without wiping bcache2, to make
>>>> sure btrfs-restore can salvage enough data, then wipeing bcache2.
>>>>
>>>> Thanks,
>>>> Qu
>>>
>>> Is it possible to shrink the size of bcache2 btrfs without making
>>> everything worse?
>>> I need more disk space but I still need bcache2 itself.
>>
>> That is kinda possible, but please keep in mind that, even in the best
>> case, it still needs to write some (very small amount) metadata into the
>> fs, thus I can't ensure it won't make things worse, or even possible
>> without falling back to RO.
>>
>> You need to dump the device extent tree, to determine the where the last
>> dev extent is for each device, then shrink to that size.
>>
>> Some example here:
>>
>> # btrfs ins dump-tree -t dev /dev/nvme/btrfs
>> ...
>>
>>         item 6 key (1 DEV_EXTENT 2169503744) itemoff 15955 itemsize 48
>>                 dev extent chunk_tree 3
>>                 chunk_objectid 256 chunk_offset 2169503744 length 1073741824
>>                 chunk_tree_uuid 00000000-0000-0000-0000-000000000000
>>
>> Here for the key, 1 means devid 1, 2169503744 means where the device
>> extent starts at. 1073741824 is the length of the device extent.
>>
>> In above case, the device with devid 1 can be resized to 2169503744 +
>> 1073741824, without relocating any data/metadata.
>>
>> # time btrfs fi resize 1:3243245568 /mnt/btrfs/
>> Resize '/mnt/btrfs/' of '1:3243245568'
>>
>> real    0m0.013s
>> user    0m0.006s
>> sys     0m0.004s
>>
>> And the dump-tree shows the same last device extent:
>> ...
>>         item 6 key (1 DEV_EXTENT 2169503744) itemoff 15955 itemsize 48
>>                 dev extent chunk_tree 3
>>                 chunk_objectid 256 chunk_offset 2169503744 length 1073741824
>>                 chunk_tree_uuid 00000000-0000-0000-0000-000000000000
>>
>> (Maybe it's a good time to implement some like fast shrink for btrfs-progs)
>>
>> Of course, after resizing btrfs, you still need to resize bcache, but
>> that's not related to btrfs (and I am not familiar with bcache either).
>>
>> Thanks,
>> Qu
>>
>>>
>>> Regards,
>>> Chiung-Ming Huang
>>>
>>>
>>>>>
>>>>> /dev/bcache2, ID: 1
>>>>>    Device size:             9.09TiB
>>>>>    Device slack:              0.00B
>>>>>    Data,RAID1:              3.93TiB
>>>>>    Metadata,RAID1:          2.00GiB
>>>>>    System,RAID1:           32.00MiB
>>>>>    Unallocated:             5.16TiB
>>>>>
>>>>> /dev/bcache3, ID: 3
>>>>>    Device size:             2.73TiB
>>>>>    Device slack:              0.00B
>>>>>    Data,single:           378.00GiB
>>>>>    Data,RAID1:            355.00GiB
>>>>>    Metadata,single:         2.00GiB
>>>>>    Metadata,RAID1:         11.00GiB
>>>>>    Unallocated:             2.00TiB
>>>>>
>>>>> /dev/bcache4, ID: 5
>>>>>    Device size:             9.09TiB
>>>>>    Device slack:              0.00B
>>>>>    Data,single:             2.93TiB
>>>>>    Data,RAID1:              4.15TiB
>>>>>    Metadata,single:         6.00GiB
>>>>>    Metadata,RAID1:         11.00GiB
>>>>>    System,RAID1:           32.00MiB
>>>>>    Unallocated:             2.00TiB
>>>>>
>>>>> Regards,
>>>>> Chiung-Ming Huang
>>>>>
>>>>
>>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]