All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Hans van Kranenburg <Hans.van.Kranenburg@mendix.com>,
	"dsterba@suse.cz" <dsterba@suse.cz>,
	Nikolay Borisov <nborisov@suse.com>, Qu Wenruo <wqu@suse.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
Date: Thu, 3 Jan 2019 09:10:14 +0800	[thread overview]
Message-ID: <ee37d670-b5c1-2bb3-7533-06190cc2fa79@gmx.com> (raw)
In-Reply-To: <59536453-b5f5-9283-bd87-d61e3abf211f@mendix.com>



On 2019/1/3 上午8:57, Hans van Kranenburg wrote:
> Hi,
> 
> On 1/3/19 1:14 AM, Qu Wenruo wrote:
>>
>>
>> On 2019/1/3 上午12:21, David Sterba wrote:
>>> On Fri, Dec 28, 2018 at 05:28:13PM +0800, Qu Wenruo wrote:
>>>> On 2018/12/28 下午5:15, Nikolay Borisov wrote:
>>>>> On 28.12.18 г. 10:37 ч., Qu Wenruo wrote:
>>>>>> This patchset can be fetched from:
>>>>>> https://github.com/adam900710/linux/tree/bg_tree
>>>>>> Which is based on v4.20-rc1 tag.
>>>>>>
>>>>>> This patchset will hugely reduce mount time of large fs by putting all
>>>>>> block group items into its own tree.
> 
> Thanks a lot for writing a proof of concept for this! This is great.

Not PoC anymore. :)
It passes all tests without regression.

> 
>>>>>> The old behavior will try to read out all block group items at mount
>>>>>> time, however due to the key of block group items are scattered across
>>>>>> tons of extent items, we must call btrfs_search_slot() for each block
>>>>>> group.
>>>>>>
>>>>>> It works fine for small fs, but when number of block groups goes beyond
>>>>>> 200, such tree search will become a random read, causing obvious slow
>>>>>> down.
>>>>>>
>>>>>> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
>>>>>> put CHUNK_ITEMS into their own tree and package them next to each other.
>>>>>>
>>>>>>
>>>>>> Following this idea, we could do the same thing for block group items,
>>>>>> so instead of triggering btrfs_search_slot() for each block group, we
>>>>>> just call btrfs_next_item() and under most case we could finish in
>>>>>> memory, and hugely speed up mount (see BENCHMARK below).
> 
> +many, this is a usability "bug" that comes up regularly on mailing list
> and IRC, people asking why their filesystem takes long to mount.
> 
> I also have some filesystems that I have to set noauto in fstab, and
> then after booting mount manually, and then do some other manual tasks,
> because having them mount automatically during boot causes timeouts and
> stuff.
> 
>>>>>> The only disadvantage is, this method introduce an incompatible feature,
>>>>>> so existing fs can't use this feature directly.
>>>>>> Either specify it at mkfs time, or use btrfs-progs offline convert tool
>>>>>> (*).
>>>>>
>>>>> What if we start recording block group items in the chunk tree?
>>>>
>>>> Then chunk tree will be too hot.
>>>>
>>>> Currently chunk tree is pretty stable, only get modified at bg
>>>> creation/deletion time.
>>>>
>>>> Considering how important chunk tree is, I prefer to make chunk root as
>>>> cold as possible.
> 
> This makes sense.
> 
>>>> On the other hand, block group items are pretty hot (although less hot
>>>> compared to old extent tree), so it still makes sense to put them into
>>>> one tree, allow chunk tree to be as cold as ice, while keep block group
>>>> items relatively safe compared to old extent tree.
>>>
>>> A feature like this should come with an analysis of both approaches in
>>> advance. Both have pros and cons that we need to weigh. Eg. I'm not more
>>> for storing the items in an existing tree, possibly creating a new tree
>>> item that would pack the bg items together at the beginning of the tree.
>>>
>>> The update frequency of the tree is an aspect that I haven't considered
>>> before but I think it's a good point.
>>>
>>> The tree holding the bg items can be considered fundamental and requires
>>> a backup pointer in the superblock. So this would need more work.
>>
>> Right, for backup root it indeed makes sense.
> 
> I don't really understand why this backup roots mechanism keeps being a
> thing, because technically btrfs cannot guarantee at all that there will
> be anything usable left right after the old metadata extents are
> unpinned...?

It's a fail safe feature for btrfs-store or similar recovery method.

But on the other hand, for RO usage, we don't even need block group
items at all, just like my skip_bg mount option.

And for RW mount, if all other trees are OK, btrfs-check should be able
to rebuild it.

> 
>> However for another key type method, I don't really think there is any pro.
>>
>> To pack bg items together, new key type is needed anyway.
> 
> Not if the block group items are the only thing in the tree...?

That's just the method the patchset is using.

> 
>> With new key type, no matter where the new bg items are, older kernel
>> won't be compatible, thus still INCOMPAT feature.
>>
>> And for whatever the tree holding block group items, it will be as hot
>> as extent tree used to be, bring up the corruption possibility to the
>> whatever the existing is. Or slow down the tree.
>>
>> So at least from my respect of view, storing (new) bg items in existing
>> tree doesn't make sense.
>>
>> However I think we should put more discussion on the possible new block
>> group item structure design.
>> E.g. Remove chunk_objectid member, or even each block group has its own
>> tree.
> 
> Just thinking out loud...
> 
> It seems to me that keeping the same key type and btrfs_block_group_item
> struct and same key values as they have in extent tree would be
> desirable if both old and new code has to co-exist in the kernel.

Yes, that the main point of current implementation.

> 
> This is easy to review...
> 
> -	struct btrfs_root *root = fs_info->extent_root;
> +	struct btrfs_root *root;
> 
> [...]
> 
> +	if (btrfs_fs_incompat(fs_info, BG_TREE))
> +		root = fs_info->bg_root;
> +	else
> +		root = fs_info->extent_root;
> 
> ...but creating a new different struct and key type would cause much
> more invasive code changes and duplication (and bugs) all over the
> place, or wrappers to handle either scenario.
> 
> I mean, who cares about some unused chunk_objectid field on a multi-TiB
> filesystem...

Well, we have similar feature already, skinny_metadata reduces metadata
backref size, and it's now already mainlined.

> 
> I'd vote for doing things, and more "design for today".

Makes sense.

> Otherwise the
> same might happen that also happens with some other topics every time...
> it ends up with the idea to rewrite half btrfs and then in the end
> nothing happens at all, and the users are still unhappy. ;-)

Well, in that case I prefer to create another fs, with "better" design
from the very beginning.

Thanks,
Qu

> 
> Even when splitting the extent tree into multiple trees ever, it would
> still be a good idea to have this BG_TREE.
> 

      reply	other threads:[~2019-01-03  1:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-28  8:37 [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
2018-12-28  8:37 ` [PATCH RFC 1/2] btrfs: Refactor btrfs_read_block_groups() Qu Wenruo
2018-12-28  8:37 ` [PATCH RFC 2/2] btrfs: Introduce new incompat feature, BG_TREE Qu Wenruo
2018-12-28  9:15 ` [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Nikolay Borisov
2018-12-28  9:28   ` Qu Wenruo
2019-01-02 16:21     ` David Sterba
2019-01-03  0:14       ` Qu Wenruo
2019-01-03  0:57         ` Hans van Kranenburg
2019-01-03  1:10           ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ee37d670-b5c1-2bb3-7533-06190cc2fa79@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=Hans.van.Kranenburg@mendix.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nborisov@suse.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.