[PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
@ 2018-12-28  8:37 Qu Wenruo
  2018-12-28  8:37 ` [PATCH RFC 1/2] btrfs: Refactor btrfs_read_block_groups() Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Qu Wenruo @ 2018-12-28  8:37 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from:
https://github.com/adam900710/linux/tree/bg_tree
Which is based on v4.20-rc1 tag.

This patchset will hugely reduce mount time of large fs by putting all
block group items into its own tree.

The old behavior will try to read out all block group items at mount
time, however due to the key of block group items are scattered across
tons of extent items, we must call btrfs_search_slot() for each block
group.

It works fine for small fs, but when number of block groups goes beyond
200, such tree search will become a random read, causing obvious slow
down.

On the other hand, btrfs_read_chunk_tree() is still very fast, since we
put CHUNK_ITEMS into their own tree and package them next to each other.

Following this idea, we could do the same thing for block group items,
so instead of triggering btrfs_search_slot() for each block group, we
just call btrfs_next_item() and under most case we could finish in
memory, and hugely speed up mount (see BENCHMARK below).

The only disadvantage is, this method introduce an incompatible feature,
so existing fs can't use this feature directly.
Either specify it at mkfs time, or use btrfs-progs offline convert tool
(*).

*: Mkfs and convert tool are doing the same work, however I haven't
decide if I should put this feature to btrfstune.

The RFC tag is for the comprehensive test and sysfs interface.
At least during my filling test it definitely works fine.

[[Benchmark]]
Physical device:	HDD (7200RPM)
Nodesize:		4K  (to bump up tree height)
Used size:		250G
Total size:		500G
Extent data size:	1M

All file extents on disk is in 1M size, ensured by using fallocate.

Without patchset:
Use ftrace function graph:

  3)               |  open_ctree [btrfs]() {
  3)               |    btrfs_read_chunk_tree [btrfs]() {
  3) * 69033.31 us |    }
  3)               |    btrfs_verify_dev_extents [btrfs]() {
  3) * 90376.15 us |    }
  3)               |    btrfs_read_block_groups [btrfs]() {
  2) $ 2733853 us  |    } /* btrfs_read_block_groups [btrfs] */
  2) $ 3168384 us  |  } /* open_ctree [btrfs] */

 btrfs_read_block_groups() takes 87% of the total mount time,

With patchset, and use -O bg-tree mkfs option:
  7)               |  open_ctree [btrfs]() {
  7)               |    btrfs_read_chunk_tree [btrfs]() {
  7) # 2448.562 us |    }
  7)               |    btrfs_verify_dev_extents [btrfs]() {
  7) * 19802.02 us |    }
  7)               |    btrfs_read_block_groups [btrfs]() {
  7) # 8610.397 us |    }
  7) @ 113498.6 us |  }

  open_ctree() time is only 3% of original mount time.
  And btrfs_read_block_groups() only takes 7.6% of total open_ctree()
  execution time.

Qu Wenruo (2):
  btrfs: Refactor btrfs_read_block_groups()
  btrfs: Introduce new incompat feature, BG_TREE

 fs/btrfs/ctree.h                |   5 +-
 fs/btrfs/disk-io.c              |  13 ++
 fs/btrfs/extent-tree.c          | 300 ++++++++++++++++++++------------
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |   3 +
 5 files changed, 206 insertions(+), 116 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH RFC 1/2] btrfs: Refactor btrfs_read_block_groups()
  2018-12-28  8:37 [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
@ 2018-12-28  8:37 ` Qu Wenruo
  2018-12-28  8:37 ` [PATCH RFC 2/2] btrfs: Introduce new incompat feature, BG_TREE Qu Wenruo
  2018-12-28  9:15 ` [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Nikolay Borisov
  2 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2018-12-28  8:37 UTC (permalink / raw)
  To: linux-btrfs

Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function read_one_block_group().

This allows read_one_block_group to be reused for later BG_TREE feature.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-tree.c | 207 ++++++++++++++++++++---------------------
 1 file changed, 103 insertions(+), 104 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a1febf155747..367b1a6cce60 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9948,6 +9948,105 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
+static int read_one_block_group(struct btrfs_fs_info *info,
+				struct btrfs_path *path,
+				int need_clear)
+{
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_block_group_cache *cache;
+	struct btrfs_space_info *space_info;
+	struct btrfs_key key;
+	int mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
+	int slot = path->slots[0];
+	int ret;
+
+	btrfs_item_key_to_cpu(leaf, &key, slot);
+	ASSERT(key.type == BTRFS_BLOCK_GROUP_ITEM_KEY);
+
+	cache = btrfs_create_block_group_cache(info, key.objectid,
+					       key.offset);
+	if (!cache)
+		return -ENOMEM;
+
+	if (need_clear) {
+		/*
+		 * When we mount with old space cache, we need to
+		 * set BTRFS_DC_CLEAR and set dirty flag.
+		 *
+		 * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
+		 *    truncate the old free space cache inode and
+		 *    setup a new one.
+		 * b) Setting 'dirty flag' makes sure that we flush
+		 *    the new space cache info onto disk.
+		 */
+		if (btrfs_test_opt(info, SPACE_CACHE))
+			cache->disk_cache_state = BTRFS_DC_CLEAR;
+	}
+	read_extent_buffer(leaf, &cache->item,
+			   btrfs_item_ptr_offset(leaf, slot),
+			   sizeof(cache->item));
+	cache->flags = btrfs_block_group_flags(&cache->item);
+	if (!mixed && ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
+	    (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
+			btrfs_err(info,
+"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
+				  cache->key.objectid);
+			ret = -EINVAL;
+			goto error;
+	}
+	ret = exclude_super_stripes(cache);
+	if (ret) {
+		/*
+		 * We may have excluded something, so call this just in
+		 * case.
+		 */
+		free_excluded_extents(cache);
+		goto error;
+	}
+
+	/*
+	 * check for two cases, either we are full, and therefore
+	 * don't need to bother with the caching work since we won't
+	 * find any space, or we are empty, and we can just add all
+	 * the space in and be done with it.  This saves us _alot_ of
+	 * time, particularly in the full case.
+	 */
+	if (key.offset == btrfs_block_group_used(&cache->item)) {
+		cache->last_byte_to_unpin = (u64)-1;
+		cache->cached = BTRFS_CACHE_FINISHED;
+		free_excluded_extents(cache);
+	} else if (btrfs_block_group_used(&cache->item) == 0) {
+		cache->last_byte_to_unpin = (u64)-1;
+		cache->cached = BTRFS_CACHE_FINISHED;
+		add_new_free_space(cache, key.objectid,
+				   key.objectid + key.offset);
+		free_excluded_extents(cache);
+	}
+	ret = btrfs_add_block_group_cache(info, cache);
+	if (ret) {
+		btrfs_remove_free_space_cache(cache);
+		goto error;
+	}
+	trace_btrfs_add_block_group(info, cache, 0);
+	update_space_info(info, cache->flags, key.offset,
+			  btrfs_block_group_used(&cache->item),
+			  cache->bytes_super, &space_info);
+
+	cache->space_info = space_info;
+	link_block_group(cache);
+	set_avail_alloc_bits(info, cache->flags);
+	if (btrfs_chunk_readonly(info, cache->key.objectid)) {
+		inc_block_group_ro(cache, 1);
+	} else if (btrfs_block_group_used(&cache->item) == 0) {
+		ASSERT(list_empty(&cache->bg_list));
+		btrfs_mark_bg_unused(cache);
+	}
+	return 0;
+error:
+	btrfs_put_block_group(cache);
+	return ret;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -9955,15 +10054,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	struct btrfs_block_group_cache *cache;
 	struct btrfs_space_info *space_info;
 	struct btrfs_key key;
-	struct btrfs_key found_key;
-	struct extent_buffer *leaf;
 	int need_clear = 0;
 	u64 cache_gen;
-	u64 feature;
-	int mixed;
-
-	feature = btrfs_super_incompat_flags(info->super_copy);
-	mixed = !!(feature & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS);
 
 	key.objectid = 0;
 	key.offset = 0;
@@ -9987,107 +10079,14 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		if (ret != 0)
 			goto error;
 
-		leaf = path->nodes[0];
-		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
 
-		cache = btrfs_create_block_group_cache(info, found_key.objectid,
-						       found_key.offset);
-		if (!cache) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (need_clear) {
-			/*
-			 * When we mount with old space cache, we need to
-			 * set BTRFS_DC_CLEAR and set dirty flag.
-			 *
-			 * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
-			 *    truncate the old free space cache inode and
-			 *    setup a new one.
-			 * b) Setting 'dirty flag' makes sure that we flush
-			 *    the new space cache info onto disk.
-			 */
-			if (btrfs_test_opt(info, SPACE_CACHE))
-				cache->disk_cache_state = BTRFS_DC_CLEAR;
-		}
-
-		read_extent_buffer(leaf, &cache->item,
-				   btrfs_item_ptr_offset(leaf, path->slots[0]),
-				   sizeof(cache->item));
-		cache->flags = btrfs_block_group_flags(&cache->item);
-		if (!mixed &&
-		    ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
-		    (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
-			btrfs_err(info,
-"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
-				  cache->key.objectid);
-			ret = -EINVAL;
+		ret = read_one_block_group(info, path, need_clear);
+		if (ret < 0)
 			goto error;
-		}
 
-		key.objectid = found_key.objectid + found_key.offset;
+		key.objectid = key.objectid + key.offset;
 		btrfs_release_path(path);
-
-		/*
-		 * We need to exclude the super stripes now so that the space
-		 * info has super bytes accounted for, otherwise we'll think
-		 * we have more space than we actually do.
-		 */
-		ret = exclude_super_stripes(cache);
-		if (ret) {
-			/*
-			 * We may have excluded something, so call this just in
-			 * case.
-			 */
-			free_excluded_extents(cache);
-			btrfs_put_block_group(cache);
-			goto error;
-		}
-
-		/*
-		 * check for two cases, either we are full, and therefore
-		 * don't need to bother with the caching work since we won't
-		 * find any space, or we are empty, and we can just add all
-		 * the space in and be done with it.  This saves us _alot_ of
-		 * time, particularly in the full case.
-		 */
-		if (found_key.offset == btrfs_block_group_used(&cache->item)) {
-			cache->last_byte_to_unpin = (u64)-1;
-			cache->cached = BTRFS_CACHE_FINISHED;
-			free_excluded_extents(cache);
-		} else if (btrfs_block_group_used(&cache->item) == 0) {
-			cache->last_byte_to_unpin = (u64)-1;
-			cache->cached = BTRFS_CACHE_FINISHED;
-			add_new_free_space(cache, found_key.objectid,
-					   found_key.objectid +
-					   found_key.offset);
-			free_excluded_extents(cache);
-		}
-
-		ret = btrfs_add_block_group_cache(info, cache);
-		if (ret) {
-			btrfs_remove_free_space_cache(cache);
-			btrfs_put_block_group(cache);
-			goto error;
-		}
-
-		trace_btrfs_add_block_group(info, cache, 0);
-		update_space_info(info, cache->flags, found_key.offset,
-				  btrfs_block_group_used(&cache->item),
-				  cache->bytes_super, &space_info);
-
-		cache->space_info = space_info;
-
-		link_block_group(cache);
-
-		set_avail_alloc_bits(info, cache->flags);
-		if (btrfs_chunk_readonly(info, cache->key.objectid)) {
-			inc_block_group_ro(cache, 1);
-		} else if (btrfs_block_group_used(&cache->item) == 0) {
-			ASSERT(list_empty(&cache->bg_list));
-			btrfs_mark_bg_unused(cache);
-		}
 	}
 
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 2/2] btrfs: Introduce new incompat feature, BG_TREE
  2018-12-28  8:37 [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
  2018-12-28  8:37 ` [PATCH RFC 1/2] btrfs: Refactor btrfs_read_block_groups() Qu Wenruo
@ 2018-12-28  8:37 ` Qu Wenruo
  2018-12-28  9:15 ` [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Nikolay Borisov
  2 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2018-12-28  8:37 UTC (permalink / raw)
  To: linux-btrfs

The overall idea of the new BG_TREE is pretty simple, put
BLOCK_GROUP_ITEMS into a separate tree.

This brings one obvious enhancement:
- Reduce mount time of large fs

However this feature doesn't accept mixed block groups (some in extent
tree, some in bg tree), this is designed to make kernel handle less
extra behavior, thus less bugs.

For existing fs to take advantage of this feature, btrfs-progs will
provide offline converting option.

[[Benchmark]]
Physical device:	HDD (7200RPM)
Nodesize:		4K  (to bump up tree height)
Used size:		250G
Total size:		500G
Extent data size:	1M

All file extents on disk is in 1M size, ensured by using fallocate.

Without patchset:
Use ftrace function graph:

  3)               |  open_ctree [btrfs]() {
  3)               |    btrfs_read_chunk_tree [btrfs]() {
  3) * 69033.31 us |    }
  3)               |    btrfs_verify_dev_extents [btrfs]() {
  3) * 90376.15 us |    }
  3)               |    btrfs_read_block_groups [btrfs]() {
  2) $ 2733853 us  |    } /* btrfs_read_block_groups [btrfs] */
  2) $ 3168384 us  |  } /* open_ctree [btrfs] */

 btrfs_read_block_groups() takes 87% of the total mount time,

With patchset, and use -O bg-tree mkfs option:
  7)               |  open_ctree [btrfs]() {
  7)               |    btrfs_read_chunk_tree [btrfs]() {
  7) # 2448.562 us |    }
  7)               |    btrfs_verify_dev_extents [btrfs]() {
  7) * 19802.02 us |    }
  7)               |    btrfs_read_block_groups [btrfs]() {
  7) # 8610.397 us |    }
  7) @ 113498.6 us |  }

  open_ctree() time is only 3% of original mount time.
  And btrfs_read_block_groups() only takes 7.6% of total open_ctree()
  execution time.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h                |   5 +-
 fs/btrfs/disk-io.c              |  13 ++++
 fs/btrfs/extent-tree.c          | 103 +++++++++++++++++++++++++++-----
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |   3 +
 5 files changed, 108 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 80953528572d..8d27ce10e319 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -265,7 +265,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_RAID56 |		\
 	 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
-	 BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+	 BTRFS_FEATURE_INCOMPAT_NO_HOLES |		\
+	 BTRFS_FEATURE_INCOMPAT_BG_TREE)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
@@ -758,6 +759,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *quota_root;
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *free_space_root;
+	struct btrfs_root *bg_root;
 
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
@@ -2957,6 +2959,7 @@ static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 	kfree(fs_info->quota_root);
 	kfree(fs_info->uuid_root);
 	kfree(fs_info->free_space_root);
+	kfree(fs_info->bg_root);
 	kfree(fs_info->super_copy);
 	kfree(fs_info->super_for_commit);
 	security_free_mnt_opts(&fs_info->security_opts);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0ab41da91d1..b9986f1fcd7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1568,6 +1568,8 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
 		return fs_info->free_space_root ? fs_info->free_space_root :
 						  ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_BLOCK_GROUP_TREE_OBJECTID)
+		return fs_info->bg_root ? fs_info->bg_root : ERR_PTR(-ENOENT);
 again:
 	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root) {
@@ -2065,6 +2067,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root)
 	if (chunk_root)
 		free_root_extent_buffers(info->chunk_root);
 	free_root_extent_buffers(info->free_space_root);
+	free_root_extent_buffers(info->bg_root);
 }
 
 void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info)
@@ -2336,6 +2339,16 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
 	fs_info->extent_root = root;
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE)) {
+		location.objectid = BTRFS_BLOCK_GROUP_TREE_OBJECTID;
+		root = btrfs_read_tree_root(tree_root, &location);
+		if (IS_ERR(root)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->bg_root = root;
+	}
 	location.objectid = BTRFS_DEV_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 367b1a6cce60..32f71305fa49 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3298,11 +3298,15 @@ static int write_one_cache_group(struct btrfs_trans_handle *trans,
 				 struct btrfs_block_group_cache *cache)
 {
 	int ret;
-	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_root *root;
 	unsigned long bi;
 	struct extent_buffer *leaf;
 
-	ret = btrfs_search_slot(trans, extent_root, &cache->key, path, 0, 1);
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+	ret = btrfs_search_slot(trans, root, &cache->key, path, 0, 1);
 	if (ret) {
 		if (ret > 0)
 			ret = -ENOENT;
@@ -10047,6 +10051,56 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	return ret;
 }
 
+static int read_block_group_tree(struct btrfs_fs_info *info, int need_clear)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	int ret;
+
+	key.objectid = 0;
+	key.offset = 0;
+	key.type = 0;
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(NULL, info->bg_root, &key, path, 0, 0);
+	if (ret < 0)
+		return ret;
+	if (ret == 0) {
+		btrfs_err(info,
+			  "found invalid block group bytenr %llu len %llu",
+			  key.objectid, key.offset);
+		ret = -EUCLEAN;
+		goto out;
+	}
+
+	while (1) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (key.type != BTRFS_BLOCK_GROUP_ITEM_KEY) {
+			btrfs_err(info,
+		"found invalid key (%llu, %u, %llu) in block group tree",
+				  key.objectid, key.type, key.offset);
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		ret = read_one_block_group(info, path, need_clear);
+		if (ret < 0)
+			goto out;
+		ret = btrfs_next_item(info->bg_root, path);
+		if (ret < 0)
+			goto out;
+		if (ret > 0) {
+			ret = 0;
+			goto out;
+		}
+	}
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -10072,21 +10126,28 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	if (btrfs_test_opt(info, CLEAR_CACHE))
 		need_clear = 1;
 
-	while (1) {
-		ret = find_first_block_group(info, path, &key);
-		if (ret > 0)
-			break;
-		if (ret != 0)
+	if (btrfs_fs_incompat(info, BG_TREE)) {
+		ret = read_block_group_tree(info, need_clear);
+		if (ret < 0)
 			goto error;
+	} else {
+		while (1) {
+			ret = find_first_block_group(info, path, &key);
+			if (ret > 0)
+				break;
+			if (ret != 0)
+				goto error;
 
-		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+			btrfs_item_key_to_cpu(path->nodes[0], &key,
+					      path->slots[0]);
 
-		ret = read_one_block_group(info, path, need_clear);
-		if (ret < 0)
-			goto error;
+			ret = read_one_block_group(info, path, need_clear);
+			if (ret < 0)
+				goto error;
 
-		key.objectid = key.objectid + key.offset;
-		btrfs_release_path(path);
+			key.objectid = key.objectid + key.offset;
+			btrfs_release_path(path);
+		}
 	}
 
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
@@ -10123,7 +10184,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_block_group_cache *block_group;
-	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_root *root;
 	struct btrfs_block_group_item item;
 	struct btrfs_key key;
 	int ret = 0;
@@ -10131,6 +10192,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 	if (!trans->can_flush_pending_bgs)
 		return;
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+
 	while (!list_empty(&trans->new_bgs)) {
 		block_group = list_first_entry(&trans->new_bgs,
 					       struct btrfs_block_group_cache,
@@ -10143,7 +10209,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		memcpy(&key, &block_group->key, sizeof(key));
 		spin_unlock(&block_group->lock);
 
-		ret = btrfs_insert_item(trans, extent_root, &key, &item,
+		ret = btrfs_insert_item(trans, root, &key, &item,
 					sizeof(item));
 		if (ret)
 			btrfs_abort_transaction(trans, ret);
@@ -10254,7 +10320,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     u64 group_start, struct extent_map *em)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_root *root;
 	struct btrfs_path *path;
 	struct btrfs_block_group_cache *block_group;
 	struct btrfs_free_cluster *cluster;
@@ -10272,6 +10338,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	BUG_ON(!block_group);
 	BUG_ON(!block_group->ro);
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+
 	trace_btrfs_remove_block_group(block_group);
 	/*
 	 * Free the reserved super bytes from this block group before
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5ca1d21fc4a7..6e85ef49d97f 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -269,6 +269,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_RAID56		(1ULL << 7)
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_BG_TREE		(1ULL << 11)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index aff1356c2bb8..6b21a6700a9e 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -48,6 +48,9 @@
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* sotre BLOCK_GROUP_ITEMs in a seprate tree */
+#define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
  2018-12-28  8:37 [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
  2018-12-28  8:37 ` [PATCH RFC 1/2] btrfs: Refactor btrfs_read_block_groups() Qu Wenruo
  2018-12-28  8:37 ` [PATCH RFC 2/2] btrfs: Introduce new incompat feature, BG_TREE Qu Wenruo
@ 2018-12-28  9:15 ` Nikolay Borisov
  2018-12-28  9:28   ` Qu Wenruo
  2 siblings, 1 reply; 9+ messages in thread
From: Nikolay Borisov @ 2018-12-28  9:15 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 28.12.18 г. 10:37 ч., Qu Wenruo wrote:
> This patchset can be fetched from:
> https://github.com/adam900710/linux/tree/bg_tree
> Which is based on v4.20-rc1 tag.
> 
> This patchset will hugely reduce mount time of large fs by putting all
> block group items into its own tree.
> 
> The old behavior will try to read out all block group items at mount
> time, however due to the key of block group items are scattered across
> tons of extent items, we must call btrfs_search_slot() for each block
> group.
> 
> It works fine for small fs, but when number of block groups goes beyond
> 200, such tree search will become a random read, causing obvious slow
> down.
> 
> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
> put CHUNK_ITEMS into their own tree and package them next to each other.
> 
> 
> Following this idea, we could do the same thing for block group items,
> so instead of triggering btrfs_search_slot() for each block group, we
> just call btrfs_next_item() and under most case we could finish in
> memory, and hugely speed up mount (see BENCHMARK below).
> 
> The only disadvantage is, this method introduce an incompatible feature,
> so existing fs can't use this feature directly.
> Either specify it at mkfs time, or use btrfs-progs offline convert tool
> (*).

What if we start recording block group items in the chunk tree? Would
that bring any benefit without creating yet another tree, though likely
we'd still have to use an incompat flag? That way shouldn't BLOCK_GROUP
items be chunked together?

> 
> *: Mkfs and convert tool are doing the same work, however I haven't
> decide if I should put this feature to btrfstune.
> 
> The RFC tag is for the comprehensive test and sysfs interface.
> At least during my filling test it definitely works fine.
> 
> [[Benchmark]]
> Physical device:	HDD (7200RPM)
> Nodesize:		4K  (to bump up tree height)
> Used size:		250G
> Total size:		500G
> Extent data size:	1M
> 
> All file extents on disk is in 1M size, ensured by using fallocate.
> 
> Without patchset:
> Use ftrace function graph:
> 
>   3)               |  open_ctree [btrfs]() {
>   3)               |    btrfs_read_chunk_tree [btrfs]() {
>   3) * 69033.31 us |    }
>   3)               |    btrfs_verify_dev_extents [btrfs]() {
>   3) * 90376.15 us |    }
>   3)               |    btrfs_read_block_groups [btrfs]() {
>   2) $ 2733853 us  |    } /* btrfs_read_block_groups [btrfs] */
>   2) $ 3168384 us  |  } /* open_ctree [btrfs] */
> 
>  btrfs_read_block_groups() takes 87% of the total mount time,
> 
> With patchset, and use -O bg-tree mkfs option:
>   7)               |  open_ctree [btrfs]() {
>   7)               |    btrfs_read_chunk_tree [btrfs]() {
>   7) # 2448.562 us |    }
>   7)               |    btrfs_verify_dev_extents [btrfs]() {
>   7) * 19802.02 us |    }
>   7)               |    btrfs_read_block_groups [btrfs]() {
>   7) # 8610.397 us |    }
>   7) @ 113498.6 us |  }
> 
>   open_ctree() time is only 3% of original mount time.
>   And btrfs_read_block_groups() only takes 7.6% of total open_ctree()
>   execution time.
> 
> 
> 
> Qu Wenruo (2):
>   btrfs: Refactor btrfs_read_block_groups()
>   btrfs: Introduce new incompat feature, BG_TREE
> 
>  fs/btrfs/ctree.h                |   5 +-
>  fs/btrfs/disk-io.c              |  13 ++
>  fs/btrfs/extent-tree.c          | 300 ++++++++++++++++++++------------
>  include/uapi/linux/btrfs.h      |   1 +
>  include/uapi/linux/btrfs_tree.h |   3 +
>  5 files changed, 206 insertions(+), 116 deletions(-)
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
  2018-12-28  9:15 ` [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Nikolay Borisov
@ 2018-12-28  9:28   ` Qu Wenruo
  2019-01-02 16:21     ` David Sterba
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2018-12-28  9:28 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2018/12/28 下午5:15, Nikolay Borisov wrote:
> 
> 
> On 28.12.18 г. 10:37 ч., Qu Wenruo wrote:
>> This patchset can be fetched from:
>> https://github.com/adam900710/linux/tree/bg_tree
>> Which is based on v4.20-rc1 tag.
>>
>> This patchset will hugely reduce mount time of large fs by putting all
>> block group items into its own tree.
>>
>> The old behavior will try to read out all block group items at mount
>> time, however due to the key of block group items are scattered across
>> tons of extent items, we must call btrfs_search_slot() for each block
>> group.
>>
>> It works fine for small fs, but when number of block groups goes beyond
>> 200, such tree search will become a random read, causing obvious slow
>> down.
>>
>> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
>> put CHUNK_ITEMS into their own tree and package them next to each other.
>>
>>
>> Following this idea, we could do the same thing for block group items,
>> so instead of triggering btrfs_search_slot() for each block group, we
>> just call btrfs_next_item() and under most case we could finish in
>> memory, and hugely speed up mount (see BENCHMARK below).
>>
>> The only disadvantage is, this method introduce an incompatible feature,
>> so existing fs can't use this feature directly.
>> Either specify it at mkfs time, or use btrfs-progs offline convert tool
>> (*).
> 
> What if we start recording block group items in the chunk tree?

Then chunk tree will be too hot.

Currently chunk tree is pretty stable, only get modified at bg
creation/deletion time.

Considering how important chunk tree is, I prefer to make chunk root as
cold as possible.

On the other hand, block group items are pretty hot (although less hot
compared to old extent tree), so it still makes sense to put them into
one tree, allow chunk tree to be as cold as ice, while keep block group
items relatively safe compared to old extent tree.


Thanks,
Qu

> Would
> that bring any benefit without creating yet another tree, though likely
> we'd still have to use an incompat flag? That way shouldn't BLOCK_GROUP
> items be chunked together?
> 
>>
>> *: Mkfs and convert tool are doing the same work, however I haven't
>> decide if I should put this feature to btrfstune.
>>
>> The RFC tag is for the comprehensive test and sysfs interface.
>> At least during my filling test it definitely works fine.
>>
>> [[Benchmark]]
>> Physical device:	HDD (7200RPM)
>> Nodesize:		4K  (to bump up tree height)
>> Used size:		250G
>> Total size:		500G
>> Extent data size:	1M
>>
>> All file extents on disk is in 1M size, ensured by using fallocate.
>>
>> Without patchset:
>> Use ftrace function graph:
>>
>>   3)               |  open_ctree [btrfs]() {
>>   3)               |    btrfs_read_chunk_tree [btrfs]() {
>>   3) * 69033.31 us |    }
>>   3)               |    btrfs_verify_dev_extents [btrfs]() {
>>   3) * 90376.15 us |    }
>>   3)               |    btrfs_read_block_groups [btrfs]() {
>>   2) $ 2733853 us  |    } /* btrfs_read_block_groups [btrfs] */
>>   2) $ 3168384 us  |  } /* open_ctree [btrfs] */
>>
>>  btrfs_read_block_groups() takes 87% of the total mount time,
>>
>> With patchset, and use -O bg-tree mkfs option:
>>   7)               |  open_ctree [btrfs]() {
>>   7)               |    btrfs_read_chunk_tree [btrfs]() {
>>   7) # 2448.562 us |    }
>>   7)               |    btrfs_verify_dev_extents [btrfs]() {
>>   7) * 19802.02 us |    }
>>   7)               |    btrfs_read_block_groups [btrfs]() {
>>   7) # 8610.397 us |    }
>>   7) @ 113498.6 us |  }
>>
>>   open_ctree() time is only 3% of original mount time.
>>   And btrfs_read_block_groups() only takes 7.6% of total open_ctree()
>>   execution time.
>>
>>
>>
>> Qu Wenruo (2):
>>   btrfs: Refactor btrfs_read_block_groups()
>>   btrfs: Introduce new incompat feature, BG_TREE
>>
>>  fs/btrfs/ctree.h                |   5 +-
>>  fs/btrfs/disk-io.c              |  13 ++
>>  fs/btrfs/extent-tree.c          | 300 ++++++++++++++++++++------------
>>  include/uapi/linux/btrfs.h      |   1 +
>>  include/uapi/linux/btrfs_tree.h |   3 +
>>  5 files changed, 206 insertions(+), 116 deletions(-)
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
  2018-12-28  9:28   ` Qu Wenruo
@ 2019-01-02 16:21     ` David Sterba
  2019-01-03  0:14       ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: David Sterba @ 2019-01-02 16:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, Qu Wenruo, linux-btrfs

On Fri, Dec 28, 2018 at 05:28:13PM +0800, Qu Wenruo wrote:
> On 2018/12/28 下午5:15, Nikolay Borisov wrote:
> > On 28.12.18 г. 10:37 ч., Qu Wenruo wrote:
> >> This patchset can be fetched from:
> >> https://github.com/adam900710/linux/tree/bg_tree
> >> Which is based on v4.20-rc1 tag.
> >>
> >> This patchset will hugely reduce mount time of large fs by putting all
> >> block group items into its own tree.
> >>
> >> The old behavior will try to read out all block group items at mount
> >> time, however due to the key of block group items are scattered across
> >> tons of extent items, we must call btrfs_search_slot() for each block
> >> group.
> >>
> >> It works fine for small fs, but when number of block groups goes beyond
> >> 200, such tree search will become a random read, causing obvious slow
> >> down.
> >>
> >> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
> >> put CHUNK_ITEMS into their own tree and package them next to each other.
> >>
> >>
> >> Following this idea, we could do the same thing for block group items,
> >> so instead of triggering btrfs_search_slot() for each block group, we
> >> just call btrfs_next_item() and under most case we could finish in
> >> memory, and hugely speed up mount (see BENCHMARK below).
> >>
> >> The only disadvantage is, this method introduce an incompatible feature,
> >> so existing fs can't use this feature directly.
> >> Either specify it at mkfs time, or use btrfs-progs offline convert tool
> >> (*).
> > 
> > What if we start recording block group items in the chunk tree?
> 
> Then chunk tree will be too hot.
> 
> Currently chunk tree is pretty stable, only get modified at bg
> creation/deletion time.
> 
> Considering how important chunk tree is, I prefer to make chunk root as
> cold as possible.
> 
> On the other hand, block group items are pretty hot (although less hot
> compared to old extent tree), so it still makes sense to put them into
> one tree, allow chunk tree to be as cold as ice, while keep block group
> items relatively safe compared to old extent tree.

A feature like this should come with an analysis of both approaches in
advance. Both have pros and cons that we need to weigh. Eg. I'm not more
for storing the items in an existing tree, possibly creating a new tree
item that would pack the bg items together at the beginning of the tree.

The update frequency of the tree is an aspect that I haven't considered
before but I think it's a good point.

The tree holding the bg items can be considered fundamental and requires
a backup pointer in the superblock. So this would need more work.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
  2019-01-02 16:21     ` David Sterba
@ 2019-01-03  0:14       ` Qu Wenruo
  2019-01-03  0:57         ` Hans van Kranenburg
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2019-01-03  0:14 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2019/1/3 上午12:21, David Sterba wrote:
> On Fri, Dec 28, 2018 at 05:28:13PM +0800, Qu Wenruo wrote:
>> On 2018/12/28 下午5:15, Nikolay Borisov wrote:
>>> On 28.12.18 г. 10:37 ч., Qu Wenruo wrote:
>>>> This patchset can be fetched from:
>>>> https://github.com/adam900710/linux/tree/bg_tree
>>>> Which is based on v4.20-rc1 tag.
>>>>
>>>> This patchset will hugely reduce mount time of large fs by putting all
>>>> block group items into its own tree.
>>>>
>>>> The old behavior will try to read out all block group items at mount
>>>> time, however due to the key of block group items are scattered across
>>>> tons of extent items, we must call btrfs_search_slot() for each block
>>>> group.
>>>>
>>>> It works fine for small fs, but when number of block groups goes beyond
>>>> 200, such tree search will become a random read, causing obvious slow
>>>> down.
>>>>
>>>> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
>>>> put CHUNK_ITEMS into their own tree and package them next to each other.
>>>>
>>>>
>>>> Following this idea, we could do the same thing for block group items,
>>>> so instead of triggering btrfs_search_slot() for each block group, we
>>>> just call btrfs_next_item() and under most case we could finish in
>>>> memory, and hugely speed up mount (see BENCHMARK below).
>>>>
>>>> The only disadvantage is, this method introduce an incompatible feature,
>>>> so existing fs can't use this feature directly.
>>>> Either specify it at mkfs time, or use btrfs-progs offline convert tool
>>>> (*).
>>>
>>> What if we start recording block group items in the chunk tree?
>>
>> Then chunk tree will be too hot.
>>
>> Currently chunk tree is pretty stable, only get modified at bg
>> creation/deletion time.
>>
>> Considering how important chunk tree is, I prefer to make chunk root as
>> cold as possible.
>>
>> On the other hand, block group items are pretty hot (although less hot
>> compared to old extent tree), so it still makes sense to put them into
>> one tree, allow chunk tree to be as cold as ice, while keep block group
>> items relatively safe compared to old extent tree.
> 
> A feature like this should come with an analysis of both approaches in
> advance. Both have pros and cons that we need to weigh. Eg. I'm not more
> for storing the items in an existing tree, possibly creating a new tree
> item that would pack the bg items together at the beginning of the tree.
> 
> The update frequency of the tree is an aspect that I haven't considered
> before but I think it's a good point.
> 
> The tree holding the bg items can be considered fundamental and requires
> a backup pointer in the superblock. So this would need more work.

Right, for backup root it indeed makes sense.

However for another key type method, I don't really think there is any pro.

To pack bg items together, new key type is needed anyway.
With new key type, no matter where the new bg items are, older kernel
won't be compatible, thus still INCOMPAT feature.

And for whatever the tree holding block group items, it will be as hot
as extent tree used to be, bring up the corruption possibility to the
whatever the existing is. Or slow down the tree.

So at least from my respect of view, storing (new) bg items in existing
tree doesn't make sense.

However I think we should put more discussion on the possible new block
group item structure design.
E.g. Remove chunk_objectid member, or even each block group has its own
tree.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
  2019-01-03  0:14       ` Qu Wenruo
@ 2019-01-03  0:57         ` Hans van Kranenburg
  2019-01-03  1:10           ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Hans van Kranenburg @ 2019-01-03  0:57 UTC (permalink / raw)
  To: Qu Wenruo, dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs

Hi,

On 1/3/19 1:14 AM, Qu Wenruo wrote:
> 
> 
> On 2019/1/3 上午12:21, David Sterba wrote:
>> On Fri, Dec 28, 2018 at 05:28:13PM +0800, Qu Wenruo wrote:
>>> On 2018/12/28 下午5:15, Nikolay Borisov wrote:
>>>> On 28.12.18 г. 10:37 ч., Qu Wenruo wrote:
>>>>> This patchset can be fetched from:
>>>>> https://github.com/adam900710/linux/tree/bg_tree
>>>>> Which is based on v4.20-rc1 tag.
>>>>>
>>>>> This patchset will hugely reduce mount time of large fs by putting all
>>>>> block group items into its own tree.

Thanks a lot for writing a proof of concept for this! This is great.

>>>>> The old behavior will try to read out all block group items at mount
>>>>> time, however due to the key of block group items are scattered across
>>>>> tons of extent items, we must call btrfs_search_slot() for each block
>>>>> group.
>>>>>
>>>>> It works fine for small fs, but when number of block groups goes beyond
>>>>> 200, such tree search will become a random read, causing obvious slow
>>>>> down.
>>>>>
>>>>> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
>>>>> put CHUNK_ITEMS into their own tree and package them next to each other.
>>>>>
>>>>>
>>>>> Following this idea, we could do the same thing for block group items,
>>>>> so instead of triggering btrfs_search_slot() for each block group, we
>>>>> just call btrfs_next_item() and under most case we could finish in
>>>>> memory, and hugely speed up mount (see BENCHMARK below).

+many, this is a usability "bug" that comes up regularly on mailing list
and IRC, people asking why their filesystem takes long to mount.

I also have some filesystems that I have to set noauto in fstab, and
then after booting mount manually, and then do some other manual tasks,
because having them mount automatically during boot causes timeouts and
stuff.

>>>>> The only disadvantage is, this method introduce an incompatible feature,
>>>>> so existing fs can't use this feature directly.
>>>>> Either specify it at mkfs time, or use btrfs-progs offline convert tool
>>>>> (*).
>>>>
>>>> What if we start recording block group items in the chunk tree?
>>>
>>> Then chunk tree will be too hot.
>>>
>>> Currently chunk tree is pretty stable, only get modified at bg
>>> creation/deletion time.
>>>
>>> Considering how important chunk tree is, I prefer to make chunk root as
>>> cold as possible.

This makes sense.

>>> On the other hand, block group items are pretty hot (although less hot
>>> compared to old extent tree), so it still makes sense to put them into
>>> one tree, allow chunk tree to be as cold as ice, while keep block group
>>> items relatively safe compared to old extent tree.
>>
>> A feature like this should come with an analysis of both approaches in
>> advance. Both have pros and cons that we need to weigh. Eg. I'm not more
>> for storing the items in an existing tree, possibly creating a new tree
>> item that would pack the bg items together at the beginning of the tree.
>>
>> The update frequency of the tree is an aspect that I haven't considered
>> before but I think it's a good point.
>>
>> The tree holding the bg items can be considered fundamental and requires
>> a backup pointer in the superblock. So this would need more work.
> 
> Right, for backup root it indeed makes sense.

I don't really understand why this backup roots mechanism keeps being a
thing, because technically btrfs cannot guarantee at all that there will
be anything usable left right after the old metadata extents are
unpinned...?

> However for another key type method, I don't really think there is any pro.
> 
> To pack bg items together, new key type is needed anyway.

Not if the block group items are the only thing in the tree...?

> With new key type, no matter where the new bg items are, older kernel
> won't be compatible, thus still INCOMPAT feature.
> 
> And for whatever the tree holding block group items, it will be as hot
> as extent tree used to be, bring up the corruption possibility to the
> whatever the existing is. Or slow down the tree.
> 
> So at least from my respect of view, storing (new) bg items in existing
> tree doesn't make sense.
> 
> However I think we should put more discussion on the possible new block
> group item structure design.
> E.g. Remove chunk_objectid member, or even each block group has its own
> tree.

Just thinking out loud...

It seems to me that keeping the same key type and btrfs_block_group_item
struct and same key values as they have in extent tree would be
desirable if both old and new code has to co-exist in the kernel.

This is easy to review...

-	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_root *root;

[...]

+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;

...but creating a new different struct and key type would cause much
more invasive code changes and duplication (and bugs) all over the
place, or wrappers to handle either scenario.

I mean, who cares about some unused chunk_objectid field on a multi-TiB
filesystem...

I'd vote for doing things, and more "design for today". Otherwise the
same might happen that also happens with some other topics every time...
it ends up with the idea to rewrite half btrfs and then in the end
nothing happens at all, and the users are still unhappy. ;-)

Even when splitting the extent tree into multiple trees ever, it would
still be a good idea to have this BG_TREE.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time
  2019-01-03  0:57         ` Hans van Kranenburg
@ 2019-01-03  1:10           ` Qu Wenruo
  0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2019-01-03  1:10 UTC (permalink / raw)
  To: Hans van Kranenburg, dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2019/1/3 上午8:57, Hans van Kranenburg wrote:
> Hi,
> 
> On 1/3/19 1:14 AM, Qu Wenruo wrote:
>>
>>
>> On 2019/1/3 上午12:21, David Sterba wrote:
>>> On Fri, Dec 28, 2018 at 05:28:13PM +0800, Qu Wenruo wrote:
>>>> On 2018/12/28 下午5:15, Nikolay Borisov wrote:
>>>>> On 28.12.18 г. 10:37 ч., Qu Wenruo wrote:
>>>>>> This patchset can be fetched from:
>>>>>> https://github.com/adam900710/linux/tree/bg_tree
>>>>>> Which is based on v4.20-rc1 tag.
>>>>>>
>>>>>> This patchset will hugely reduce mount time of large fs by putting all
>>>>>> block group items into its own tree.
> 
> Thanks a lot for writing a proof of concept for this! This is great.

Not PoC anymore. :)
It passes all tests without regression.

> 
>>>>>> The old behavior will try to read out all block group items at mount
>>>>>> time, however due to the key of block group items are scattered across
>>>>>> tons of extent items, we must call btrfs_search_slot() for each block
>>>>>> group.
>>>>>>
>>>>>> It works fine for small fs, but when number of block groups goes beyond
>>>>>> 200, such tree search will become a random read, causing obvious slow
>>>>>> down.
>>>>>>
>>>>>> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
>>>>>> put CHUNK_ITEMS into their own tree and package them next to each other.
>>>>>>
>>>>>>
>>>>>> Following this idea, we could do the same thing for block group items,
>>>>>> so instead of triggering btrfs_search_slot() for each block group, we
>>>>>> just call btrfs_next_item() and under most case we could finish in
>>>>>> memory, and hugely speed up mount (see BENCHMARK below).
> 
> +many, this is a usability "bug" that comes up regularly on mailing list
> and IRC, people asking why their filesystem takes long to mount.
> 
> I also have some filesystems that I have to set noauto in fstab, and
> then after booting mount manually, and then do some other manual tasks,
> because having them mount automatically during boot causes timeouts and
> stuff.
> 
>>>>>> The only disadvantage is, this method introduce an incompatible feature,
>>>>>> so existing fs can't use this feature directly.
>>>>>> Either specify it at mkfs time, or use btrfs-progs offline convert tool
>>>>>> (*).
>>>>>
>>>>> What if we start recording block group items in the chunk tree?
>>>>
>>>> Then chunk tree will be too hot.
>>>>
>>>> Currently chunk tree is pretty stable, only get modified at bg
>>>> creation/deletion time.
>>>>
>>>> Considering how important chunk tree is, I prefer to make chunk root as
>>>> cold as possible.
> 
> This makes sense.
> 
>>>> On the other hand, block group items are pretty hot (although less hot
>>>> compared to old extent tree), so it still makes sense to put them into
>>>> one tree, allow chunk tree to be as cold as ice, while keep block group
>>>> items relatively safe compared to old extent tree.
>>>
>>> A feature like this should come with an analysis of both approaches in
>>> advance. Both have pros and cons that we need to weigh. Eg. I'm not more
>>> for storing the items in an existing tree, possibly creating a new tree
>>> item that would pack the bg items together at the beginning of the tree.
>>>
>>> The update frequency of the tree is an aspect that I haven't considered
>>> before but I think it's a good point.
>>>
>>> The tree holding the bg items can be considered fundamental and requires
>>> a backup pointer in the superblock. So this would need more work.
>>
>> Right, for backup root it indeed makes sense.
> 
> I don't really understand why this backup roots mechanism keeps being a
> thing, because technically btrfs cannot guarantee at all that there will
> be anything usable left right after the old metadata extents are
> unpinned...?

It's a fail safe feature for btrfs-store or similar recovery method.

But on the other hand, for RO usage, we don't even need block group
items at all, just like my skip_bg mount option.

And for RW mount, if all other trees are OK, btrfs-check should be able
to rebuild it.

> 
>> However for another key type method, I don't really think there is any pro.
>>
>> To pack bg items together, new key type is needed anyway.
> 
> Not if the block group items are the only thing in the tree...?

That's just the method the patchset is using.

> 
>> With new key type, no matter where the new bg items are, older kernel
>> won't be compatible, thus still INCOMPAT feature.
>>
>> And for whatever the tree holding block group items, it will be as hot
>> as extent tree used to be, bring up the corruption possibility to the
>> whatever the existing is. Or slow down the tree.
>>
>> So at least from my respect of view, storing (new) bg items in existing
>> tree doesn't make sense.
>>
>> However I think we should put more discussion on the possible new block
>> group item structure design.
>> E.g. Remove chunk_objectid member, or even each block group has its own
>> tree.
> 
> Just thinking out loud...
> 
> It seems to me that keeping the same key type and btrfs_block_group_item
> struct and same key values as they have in extent tree would be
> desirable if both old and new code has to co-exist in the kernel.

Yes, that the main point of current implementation.

> 
> This is easy to review...
> 
> -	struct btrfs_root *root = fs_info->extent_root;
> +	struct btrfs_root *root;
> 
> [...]
> 
> +	if (btrfs_fs_incompat(fs_info, BG_TREE))
> +		root = fs_info->bg_root;
> +	else
> +		root = fs_info->extent_root;
> 
> ...but creating a new different struct and key type would cause much
> more invasive code changes and duplication (and bugs) all over the
> place, or wrappers to handle either scenario.
> 
> I mean, who cares about some unused chunk_objectid field on a multi-TiB
> filesystem...

Well, we have similar feature already, skinny_metadata reduces metadata
backref size, and it's now already mainlined.

> 
> I'd vote for doing things, and more "design for today".

Makes sense.

> Otherwise the
> same might happen that also happens with some other topics every time...
> it ends up with the idea to rewrite half btrfs and then in the end
> nothing happens at all, and the users are still unhappy. ;-)

Well, in that case I prefer to create another fs, with "better" design
from the very beginning.

Thanks,
Qu

> 
> Even when splitting the extent tree into multiple trees ever, it would
> still be a good idea to have this BG_TREE.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-03  1:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-28  8:37 [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
2018-12-28  8:37 ` [PATCH RFC 1/2] btrfs: Refactor btrfs_read_block_groups() Qu Wenruo
2018-12-28  8:37 ` [PATCH RFC 2/2] btrfs: Introduce new incompat feature, BG_TREE Qu Wenruo
2018-12-28  9:15 ` [PATCH RFC 0/2] Use new incompat feature BG_TREE to hugely reduce mount time Nikolay Borisov
2018-12-28  9:28   ` Qu Wenruo
2019-01-02 16:21     ` David Sterba
2019-01-03  0:14       ` Qu Wenruo
2019-01-03  0:57         ` Hans van Kranenburg
2019-01-03  1:10           ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).