Future proof benchmark for bg-tree. The benchmark in the cover letter is in fact a pretty bad case for original mode, with a lot of EXTENT_ITEMs bumping up extent tree height, along with small node size to make searching extent tree super slow. (Although that doesn't make any difference for bg-tree). Now here is the best case scenario for original mode. Just using plain fallocate to fill 12T, so every EXTENT_DATA will be at its maximum size, causing minimal noise for block group items iteration. For short conclusion: Bg-tree still faster than best case original extent tree, by around 30%. Bg-tree should be fast enough to mount the 12T filled fs less than *ONE* *SECOND*. And due to the fact that bg-tree doesn't care how crowded the extent tree is, its block group iteration time is just O(N). So for 12T filled fs, bg-tree should always mount around the same speed, no matter how many extents are. For full details, please check the sheet: https://docs.google.com/spreadsheets/d/1YfZXiGoL9EoPcWGh0x1gIldS1ymsPgqZFQdngM-Iep0/edit?usp=sharing Thanks, Qu On 2019/1/2 下午1:29, Qu Wenruo wrote: > This patchset can be fetched from: > https://github.com/adam900710/linux/tree/bg_tree > Which is based on v4.20-rc1 tag. > > This patchset will hugely reduce mount time of large fs by putting all > block group items into its own tree. > > The old behavior will try to read out all block group items at mount > time, however due to the key of block group items are scattered across > tons of extent items, we must call btrfs_search_slot() for each block > group. > > It works fine for small fs, but when number of block groups goes beyond > 200, such tree search will become a random read, causing obvious slow > down. > > On the other hand, btrfs_read_chunk_tree() is still very fast, since we > put CHUNK_ITEMS into their own tree and package them next to each other. > > > Following this idea, we could do the same thing for block group items, > so instead of triggering btrfs_search_slot() for each block group, we > just call btrfs_next_item() and under most case we could finish in > memory, and hugely speed up mount (see BENCHMARK below). > > The only disadvantage is, this method introduce an incompatible feature, > so existing fs can't use this feature directly. > Either specify it at mkfs time, or use btrfs-progs offline convert tool > (*). > > *: Mkfs and convert tool are doing the same work, however I haven't > decide if I should put this feature to btrfstune. > > [[Benchmark]] > Physical device: HDD (7200RPM) > Nodesize: 4K (to bump up tree height) > Used size: 250G > Total size: 500G > Extent data size: 1M > > All file extents on disk is in 1M size, ensured by using fallocate. > > Without patchset: > Use ftrace function graph: > > 3) | open_ctree [btrfs]() { > 3) | btrfs_read_chunk_tree [btrfs]() { > 3) * 69033.31 us | } > 3) | btrfs_verify_dev_extents [btrfs]() { > 3) * 90376.15 us | } > 3) | btrfs_read_block_groups [btrfs]() { > 2) $ 2733853 us | } /* btrfs_read_block_groups [btrfs] */ > 2) $ 3168384 us | } /* open_ctree [btrfs] */ > > btrfs_read_block_groups() takes 87% of the total mount time, > > With patchset, and use -O bg-tree mkfs option: > 7) | open_ctree [btrfs]() { > 7) | btrfs_read_chunk_tree [btrfs]() { > 7) # 2448.562 us | } > 7) | btrfs_verify_dev_extents [btrfs]() { > 7) * 19802.02 us | } > 7) | btrfs_read_block_groups [btrfs]() { > 7) # 8610.397 us | } > 7) @ 113498.6 us | } > > open_ctree() time is only 3% of original mount time. > And btrfs_read_block_groups() only takes 7.6% of total open_ctree() > execution time. > > Changelog: > RFC->v1: > - Fix memory leak for fs_info->bg_root at module unload time. > - Add sysfs features interface. > - Testing shows no regression, so no RFC tag now. > > Qu Wenruo (2): > btrfs: Refactor btrfs_read_block_groups() > btrfs: Introduce new incompat feature, BG_TREE > > fs/btrfs/ctree.h | 5 +- > fs/btrfs/disk-io.c | 13 ++ > fs/btrfs/extent-tree.c | 300 ++++++++++++++++++++------------ > fs/btrfs/sysfs.c | 2 + > include/uapi/linux/btrfs.h | 1 + > include/uapi/linux/btrfs_tree.h | 3 + > 6 files changed, 208 insertions(+), 116 deletions(-) >