Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Hans van Kranenburg <Hans.van.Kranenburg@mendix.com>,
	"dsterba@suse.cz" <dsterba@suse.cz>,
	Nikolay Borisov <nborisov@suse.com>, Qu Wenruo <wqu@suse.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
Date: Thu, 3 Jan 2019 09:10:14 +0800	[thread overview]
Message-ID: <ee37d670-b5c1-2bb3-7533-06190cc2fa79@gmx.com> (raw)
In-Reply-To: <59536453-b5f5-9283-bd87-d61e3abf211f@mendix.com>

On 2019/1/3 上午8:57, Hans van Kranenburg wrote:
> Hi,
> 
> On 1/3/19 1:14 AM, Qu Wenruo wrote:
>>
>>
>> On 2019/1/3 上午12:21, David Sterba wrote:
>>> On Fri, Dec 28, 2018 at 05:28:13PM +0800, Qu Wenruo wrote:
>>>> On 2018/12/28 下午5:15, Nikolay Borisov wrote:
>>>>> On 28.12.18 г. 10:37 ч., Qu Wenruo wrote:
>>>>>> This patchset can be fetched from:
>>>>>> https://github.com/adam900710/linux/tree/bg_tree
>>>>>> Which is based on v4.20-rc1 tag.
>>>>>>
>>>>>> This patchset will hugely reduce mount time of large fs by putting all
>>>>>> block group items into its own tree.
> 
> Thanks a lot for writing a proof of concept for this! This is great.

Not PoC anymore. :)
It passes all tests without regression.

> 
>>>>>> The old behavior will try to read out all block group items at mount
>>>>>> time, however due to the key of block group items are scattered across
>>>>>> tons of extent items, we must call btrfs_search_slot() for each block
>>>>>> group.
>>>>>>
>>>>>> It works fine for small fs, but when number of block groups goes beyond
>>>>>> 200, such tree search will become a random read, causing obvious slow
>>>>>> down.
>>>>>>
>>>>>> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
>>>>>> put CHUNK_ITEMS into their own tree and package them next to each other.
>>>>>>
>>>>>>
>>>>>> Following this idea, we could do the same thing for block group items,
>>>>>> so instead of triggering btrfs_search_slot() for each block group, we
>>>>>> just call btrfs_next_item() and under most case we could finish in
>>>>>> memory, and hugely speed up mount (see BENCHMARK below).
> 
> +many, this is a usability "bug" that comes up regularly on mailing list
> and IRC, people asking why their filesystem takes long to mount.
> 
> I also have some filesystems that I have to set noauto in fstab, and
> then after booting mount manually, and then do some other manual tasks,
> because having them mount automatically during boot causes timeouts and
> stuff.
> 
>>>>>> The only disadvantage is, this method introduce an incompatible feature,
>>>>>> so existing fs can't use this feature directly.
>>>>>> Either specify it at mkfs time, or use btrfs-progs offline convert tool
>>>>>> (*).
>>>>>
>>>>> What if we start recording block group items in the chunk tree?
>>>>
>>>> Then chunk tree will be too hot.
>>>>
>>>> Currently chunk tree is pretty stable, only get modified at bg
>>>> creation/deletion time.
>>>>
>>>> Considering how important chunk tree is, I prefer to make chunk root as
>>>> cold as possible.
> 
> This makes sense.
> 
>>>> On the other hand, block group items are pretty hot (although less hot
>>>> compared to old extent tree), so it still makes sense to put them into
>>>> one tree, allow chunk tree to be as cold as ice, while keep block group
>>>> items relatively safe compared to old extent tree.
>>>
>>> A feature like this should come with an analysis of both approaches in
>>> advance. Both have pros and cons that we need to weigh. Eg. I'm not more
>>> for storing the items in an existing tree, possibly creating a new tree
>>> item that would pack the bg items together at the beginning of the tree.
>>>
>>> The update frequency of the tree is an aspect that I haven't considered
>>> before but I think it's a good point.
>>>
>>> The tree holding the bg items can be considered fundamental and requires
>>> a backup pointer in the superblock. So this would need more work.
>>
>> Right, for backup root it indeed makes sense.
> 
> I don't really understand why this backup roots mechanism keeps being a
> thing, because technically btrfs cannot guarantee at all that there will
> be anything usable left right after the old metadata extents are
> unpinned...?

It's a fail safe feature for btrfs-store or similar recovery method.

But on the other hand, for RO usage, we don't even need block group
items at all, just like my skip_bg mount option.

And for RW mount, if all other trees are OK, btrfs-check should be able
to rebuild it.

> 
>> However for another key type method, I don't really think there is any pro.
>>
>> To pack bg items together, new key type is needed anyway.
> 
> Not if the block group items are the only thing in the tree...?

That's just the method the patchset is using.

> 
>> With new key type, no matter where the new bg items are, older kernel
>> won't be compatible, thus still INCOMPAT feature.
>>
>> And for whatever the tree holding block group items, it will be as hot
>> as extent tree used to be, bring up the corruption possibility to the
>> whatever the existing is. Or slow down the tree.
>>
>> So at least from my respect of view, storing (new) bg items in existing
>> tree doesn't make sense.
>>
>> However I think we should put more discussion on the possible new block
>> group item structure design.
>> E.g. Remove chunk_objectid member, or even each block group has its own
>> tree.
> 
> Just thinking out loud...
> 
> It seems to me that keeping the same key type and btrfs_block_group_item
> struct and same key values as they have in extent tree would be
> desirable if both old and new code has to co-exist in the kernel.

Yes, that the main point of current implementation.

> 
> This is easy to review...
> 
> -	struct btrfs_root *root = fs_info->extent_root;
> +	struct btrfs_root *root;
> 
> [...]
> 
> +	if (btrfs_fs_incompat(fs_info, BG_TREE))
> +		root = fs_info->bg_root;
> +	else
> +		root = fs_info->extent_root;
> 
> ...but creating a new different struct and key type would cause much
> more invasive code changes and duplication (and bugs) all over the
> place, or wrappers to handle either scenario.
> 
> I mean, who cares about some unused chunk_objectid field on a multi-TiB
> filesystem...

Well, we have similar feature already, skinny_metadata reduces metadata
backref size, and it's now already mainlined.

> 
> I'd vote for doing things, and more "design for today".

Makes sense.

> Otherwise the
> same might happen that also happens with some other topics every time...
> it ends up with the idea to rewrite half btrfs and then in the end
> nothing happens at all, and the users are still unhappy. ;-)

Well, in that case I prefer to create another fs, with "better" design
from the very beginning.

Thanks,
Qu

> 
> Even when splitting the extent tree into multiple trees ever, it would
> still be a good idea to have this BG_TREE.
>