From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
Date: Tue, 8 Jan 2019 15:34:21 +0800 [thread overview]
Message-ID: <33defd6f-8fff-17cb-aae1-bec3b875b8b5@gmx.com> (raw)
In-Reply-To: <20190102052945.16325-1-wqu@suse.com>
[-- Attachment #1.1: Type: text/plain, Size: 4573 bytes --]
Future proof benchmark for bg-tree.
The benchmark in the cover letter is in fact a pretty bad case for
original mode, with a lot of EXTENT_ITEMs bumping up extent tree height,
along with small node size to make searching extent tree super slow.
(Although that doesn't make any difference for bg-tree).
Now here is the best case scenario for original mode.
Just using plain fallocate to fill 12T, so every EXTENT_DATA will be at
its maximum size, causing minimal noise for block group items iteration.
For short conclusion:
Bg-tree still faster than best case original extent tree, by around 30%.
Bg-tree should be fast enough to mount the 12T filled fs less than *ONE*
*SECOND*.
And due to the fact that bg-tree doesn't care how crowded the extent
tree is, its block group iteration time is just O(N).
So for 12T filled fs, bg-tree should always mount around the same speed,
no matter how many extents are.
For full details, please check the sheet:
https://docs.google.com/spreadsheets/d/1YfZXiGoL9EoPcWGh0x1gIldS1ymsPgqZFQdngM-Iep0/edit?usp=sharing
Thanks,
Qu
On 2019/1/2 下午1:29, Qu Wenruo wrote:
> This patchset can be fetched from:
> https://github.com/adam900710/linux/tree/bg_tree
> Which is based on v4.20-rc1 tag.
>
> This patchset will hugely reduce mount time of large fs by putting all
> block group items into its own tree.
>
> The old behavior will try to read out all block group items at mount
> time, however due to the key of block group items are scattered across
> tons of extent items, we must call btrfs_search_slot() for each block
> group.
>
> It works fine for small fs, but when number of block groups goes beyond
> 200, such tree search will become a random read, causing obvious slow
> down.
>
> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
> put CHUNK_ITEMS into their own tree and package them next to each other.
>
>
> Following this idea, we could do the same thing for block group items,
> so instead of triggering btrfs_search_slot() for each block group, we
> just call btrfs_next_item() and under most case we could finish in
> memory, and hugely speed up mount (see BENCHMARK below).
>
> The only disadvantage is, this method introduce an incompatible feature,
> so existing fs can't use this feature directly.
> Either specify it at mkfs time, or use btrfs-progs offline convert tool
> (*).
>
> *: Mkfs and convert tool are doing the same work, however I haven't
> decide if I should put this feature to btrfstune.
>
> [[Benchmark]]
> Physical device: HDD (7200RPM)
> Nodesize: 4K (to bump up tree height)
> Used size: 250G
> Total size: 500G
> Extent data size: 1M
>
> All file extents on disk is in 1M size, ensured by using fallocate.
>
> Without patchset:
> Use ftrace function graph:
>
> 3) | open_ctree [btrfs]() {
> 3) | btrfs_read_chunk_tree [btrfs]() {
> 3) * 69033.31 us | }
> 3) | btrfs_verify_dev_extents [btrfs]() {
> 3) * 90376.15 us | }
> 3) | btrfs_read_block_groups [btrfs]() {
> 2) $ 2733853 us | } /* btrfs_read_block_groups [btrfs] */
> 2) $ 3168384 us | } /* open_ctree [btrfs] */
>
> btrfs_read_block_groups() takes 87% of the total mount time,
>
> With patchset, and use -O bg-tree mkfs option:
> 7) | open_ctree [btrfs]() {
> 7) | btrfs_read_chunk_tree [btrfs]() {
> 7) # 2448.562 us | }
> 7) | btrfs_verify_dev_extents [btrfs]() {
> 7) * 19802.02 us | }
> 7) | btrfs_read_block_groups [btrfs]() {
> 7) # 8610.397 us | }
> 7) @ 113498.6 us | }
>
> open_ctree() time is only 3% of original mount time.
> And btrfs_read_block_groups() only takes 7.6% of total open_ctree()
> execution time.
>
> Changelog:
> RFC->v1:
> - Fix memory leak for fs_info->bg_root at module unload time.
> - Add sysfs features interface.
> - Testing shows no regression, so no RFC tag now.
>
> Qu Wenruo (2):
> btrfs: Refactor btrfs_read_block_groups()
> btrfs: Introduce new incompat feature, BG_TREE
>
> fs/btrfs/ctree.h | 5 +-
> fs/btrfs/disk-io.c | 13 ++
> fs/btrfs/extent-tree.c | 300 ++++++++++++++++++++------------
> fs/btrfs/sysfs.c | 2 +
> include/uapi/linux/btrfs.h | 1 +
> include/uapi/linux/btrfs_tree.h | 3 +
> 6 files changed, 208 insertions(+), 116 deletions(-)
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2019-01-08 7:34 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-02 5:29 [PATCH 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
2019-01-02 5:29 ` [PATCH 1/2] btrfs: Refactor btrfs_read_block_groups() Qu Wenruo
2019-01-02 5:29 ` [PATCH 2/2] btrfs: Introduce new incompat feature, BG_TREE Qu Wenruo
2019-01-08 7:34 ` Qu Wenruo [this message]
2019-02-25 13:02 ` [PATCH 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Hans van Kranenburg
2019-02-25 13:10 ` Qu Wenruo
2019-02-25 13:16 ` Hans van Kranenburg
2019-02-25 14:46 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=33defd6f-8fff-17cb-aae1-bec3b875b8b5@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).