Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
@ 2019-10-08  4:49 Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups() Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from:
https://github.com/adam900710/linux/tree/bg_tree
Which is based on v5.4-rc1 tag.

This patchset will hugely reduce mount time of large fs by putting all
block group items into its own tree.

The old behavior will try to read out all block group items at mount
time, however due to the key of block group items are scattered across
tons of extent items, we must call btrfs_search_slot() for each block
group.

It works fine for small fs, but when number of block groups goes beyond
200, such tree search will become a random read, causing obvious slow
down.

On the other hand, btrfs_read_chunk_tree() is still very fast, since we
put CHUNK_ITEMS into their own tree and package them next to each other.

Following this idea, we could do the same thing for block group items,
so instead of triggering btrfs_search_slot() for each block group, we
just call btrfs_next_item() and under most case we could finish in
memory, and hugely speed up mount (see BENCHMARK below).

The only disadvantage is, this method introduce an incompatible feature,
so existing fs can't use this feature directly.
Either specify it at mkfs time, or use btrfs-progs offline convert tool.

[[Benchmark]]
Since I have upgraded my rig to all NVME storage, there is no HDD
test result.

Physical device:	NVMe SSD
VM device:		VirtIO block device, backup by sparse file
Nodesize:		4K  (to bump up tree height)
Extent data size:	4M
Fs size used:		1T

All file extents on disk is in 4M size, preallocated to reduce space usage
(as the VM uses loopback block device backed by sparse file)

Without patchset:
Use ftrace function graph:

 7)               |  open_ctree [btrfs]() {
 7)               |    btrfs_read_block_groups [btrfs]() {
 7) @ 805851.8 us |    }
 7) @ 911890.2 us |  }

 btrfs_read_block_groups() takes 88% of the total mount time,

With patchset, and use -O bg-tree mkfs option:

 6)               |  open_ctree [btrfs]() {
 6)               |    btrfs_read_block_groups [btrfs]() {
 6) * 91204.69 us |    }
 6) @ 192039.5 us |  }

  open_ctree() time is only 21% of original mount time.
  And btrfs_read_block_groups() only takes 47% of total open_ctree()
  execution time.

The reason is pretty obvious when considering how many tree blocks needs
to be read from disk:
- Original extent tree:
  nodes:	55
  leaves:	1025
  total:	1080
- Block group tree:
  nodes:	1
  leaves:	13
  total:	14

Not to mention all the tree blocks readahead works pretty fine for bg
tree, as we will read every item.
While readahead for extent tree will just be a diaster, as all block
groups are scatter across the whole extent tree.

Changelog:
v2:
- Rebase to v5.4-rc1
  Minor conflicts due to code moved to block-group.c
- Fix a bug where some block groups will not be loaded at mount time
  It's a bug in that refactor patch, not exposed by previous round of
  tests.
- Add a new patch to remove a dead check
- Update benchmark to NVMe based result
  Hardware upgrade is not always a good thing for benchmark.

Qu Wenruo (3):
  btrfs: block-group: Refactor btrfs_read_block_groups()
  btrfs: disk-io: Remove unnecessary check before freeing chunk root
  btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time

 fs/btrfs/block-group.c          | 306 ++++++++++++++++++++------------
 fs/btrfs/ctree.h                |   5 +-
 fs/btrfs/disk-io.c              |  16 +-
 fs/btrfs/sysfs.c                |   2 +
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |   3 +
 6 files changed, 213 insertions(+), 120 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups()
  2019-10-08  4:49 [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-08  9:08   ` Johannes Thumshirn
                     ` (2 more replies)
  2019-10-08  4:49 ` [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 19+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function, read_one_block_group().

This allows read_one_block_group to be reused for later BG_TREE feature.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 214 +++++++++++++++++++++--------------------
 1 file changed, 108 insertions(+), 106 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bf7e3f23bba7..0c5eef0610fa 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1687,6 +1687,109 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
+static int read_one_block_group(struct btrfs_fs_info *info,
+				struct btrfs_path *path,
+				int need_clear)
+{
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_block_group_cache *cache;
+	struct btrfs_space_info *space_info;
+	struct btrfs_key key;
+	int mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
+	int slot = path->slots[0];
+	int ret;
+
+	btrfs_item_key_to_cpu(leaf, &key, slot);
+	ASSERT(key.type == BTRFS_BLOCK_GROUP_ITEM_KEY);
+
+	cache = btrfs_create_block_group_cache(info, key.objectid,
+					       key.offset);
+	if (!cache)
+		return -ENOMEM;
+
+	if (need_clear) {
+		/*
+		 * When we mount with old space cache, we need to
+		 * set BTRFS_DC_CLEAR and set dirty flag.
+		 *
+		 * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
+		 *    truncate the old free space cache inode and
+		 *    setup a new one.
+		 * b) Setting 'dirty flag' makes sure that we flush
+		 *    the new space cache info onto disk.
+		 */
+		if (btrfs_test_opt(info, SPACE_CACHE))
+			cache->disk_cache_state = BTRFS_DC_CLEAR;
+	}
+	read_extent_buffer(leaf, &cache->item,
+			   btrfs_item_ptr_offset(leaf, slot),
+			   sizeof(cache->item));
+	cache->flags = btrfs_block_group_flags(&cache->item);
+	if (!mixed && ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
+	    (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
+			btrfs_err(info,
+"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
+				  cache->key.objectid);
+			ret = -EINVAL;
+			goto error;
+	}
+
+	/*
+	 * We need to exclude the super stripes now so that the space info has
+	 * super bytes accounted for, otherwise we'll think we have more space
+	 * than we actually do.
+	 */
+	ret = exclude_super_stripes(cache);
+	if (ret) {
+		/* We may have excluded something, so call this just in case. */
+		btrfs_free_excluded_extents(cache);
+		goto error;
+	}
+
+	/*
+	 * Check for two cases, either we are full, and therefore don't need
+	 * to bother with the caching work since we won't find any space, or we
+	 * are empty, and we can just add all the space in and be done with it.
+	 * This saves us _a_lot_ of time, particularly in the full case.
+	 */
+	if (key.offset == btrfs_block_group_used(&cache->item)) {
+		cache->last_byte_to_unpin = (u64)-1;
+		cache->cached = BTRFS_CACHE_FINISHED;
+		btrfs_free_excluded_extents(cache);
+	} else if (btrfs_block_group_used(&cache->item) == 0) {
+		cache->last_byte_to_unpin = (u64)-1;
+		cache->cached = BTRFS_CACHE_FINISHED;
+		add_new_free_space(cache, key.objectid,
+				   key.objectid + key.offset);
+		btrfs_free_excluded_extents(cache);
+	}
+	ret = btrfs_add_block_group_cache(info, cache);
+	if (ret) {
+		btrfs_remove_free_space_cache(cache);
+		goto error;
+	}
+	trace_btrfs_add_block_group(info, cache, 0);
+	btrfs_update_space_info(info, cache->flags, key.offset,
+				btrfs_block_group_used(&cache->item),
+				cache->bytes_super, &space_info);
+
+	cache->space_info = space_info;
+
+	link_block_group(cache);
+
+	set_avail_alloc_bits(info, cache->flags);
+	if (btrfs_chunk_readonly(info, cache->key.objectid)) {
+		inc_block_group_ro(cache, 1);
+	} else if (btrfs_block_group_used(&cache->item) == 0) {
+		ASSERT(list_empty(&cache->bg_list));
+		btrfs_mark_bg_unused(cache);
+	}
+	return 0;
+error:
+	btrfs_put_block_group(cache);
+	return ret;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -1694,15 +1797,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	struct btrfs_block_group_cache *cache;
 	struct btrfs_space_info *space_info;
 	struct btrfs_key key;
-	struct btrfs_key found_key;
-	struct extent_buffer *leaf;
 	int need_clear = 0;
 	u64 cache_gen;
-	u64 feature;
-	int mixed;
-
-	feature = btrfs_super_incompat_flags(info->super_copy);
-	mixed = !!(feature & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS);
 
 	key.objectid = 0;
 	key.offset = 0;
@@ -1726,107 +1822,13 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		if (ret != 0)
 			goto error;
 
-		leaf = path->nodes[0];
-		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
-
-		cache = btrfs_create_block_group_cache(info, found_key.objectid,
-						       found_key.offset);
-		if (!cache) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (need_clear) {
-			/*
-			 * When we mount with old space cache, we need to
-			 * set BTRFS_DC_CLEAR and set dirty flag.
-			 *
-			 * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
-			 *    truncate the old free space cache inode and
-			 *    setup a new one.
-			 * b) Setting 'dirty flag' makes sure that we flush
-			 *    the new space cache info onto disk.
-			 */
-			if (btrfs_test_opt(info, SPACE_CACHE))
-				cache->disk_cache_state = BTRFS_DC_CLEAR;
-		}
-
-		read_extent_buffer(leaf, &cache->item,
-				   btrfs_item_ptr_offset(leaf, path->slots[0]),
-				   sizeof(cache->item));
-		cache->flags = btrfs_block_group_flags(&cache->item);
-		if (!mixed &&
-		    ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
-		    (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
-			btrfs_err(info,
-"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
-				  cache->key.objectid);
-			ret = -EINVAL;
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		ret = read_one_block_group(info, path, need_clear);
+		if (ret < 0)
 			goto error;
-		}
-
-		key.objectid = found_key.objectid + found_key.offset;
+		key.objectid += key.offset;
+		key.offset = 0;
 		btrfs_release_path(path);
-
-		/*
-		 * We need to exclude the super stripes now so that the space
-		 * info has super bytes accounted for, otherwise we'll think
-		 * we have more space than we actually do.
-		 */
-		ret = exclude_super_stripes(cache);
-		if (ret) {
-			/*
-			 * We may have excluded something, so call this just in
-			 * case.
-			 */
-			btrfs_free_excluded_extents(cache);
-			btrfs_put_block_group(cache);
-			goto error;
-		}
-
-		/*
-		 * Check for two cases, either we are full, and therefore
-		 * don't need to bother with the caching work since we won't
-		 * find any space, or we are empty, and we can just add all
-		 * the space in and be done with it.  This saves us _a_lot_ of
-		 * time, particularly in the full case.
-		 */
-		if (found_key.offset == btrfs_block_group_used(&cache->item)) {
-			cache->last_byte_to_unpin = (u64)-1;
-			cache->cached = BTRFS_CACHE_FINISHED;
-			btrfs_free_excluded_extents(cache);
-		} else if (btrfs_block_group_used(&cache->item) == 0) {
-			cache->last_byte_to_unpin = (u64)-1;
-			cache->cached = BTRFS_CACHE_FINISHED;
-			add_new_free_space(cache, found_key.objectid,
-					   found_key.objectid +
-					   found_key.offset);
-			btrfs_free_excluded_extents(cache);
-		}
-
-		ret = btrfs_add_block_group_cache(info, cache);
-		if (ret) {
-			btrfs_remove_free_space_cache(cache);
-			btrfs_put_block_group(cache);
-			goto error;
-		}
-
-		trace_btrfs_add_block_group(info, cache, 0);
-		btrfs_update_space_info(info, cache->flags, found_key.offset,
-					btrfs_block_group_used(&cache->item),
-					cache->bytes_super, &space_info);
-
-		cache->space_info = space_info;
-
-		link_block_group(cache);
-
-		set_avail_alloc_bits(info, cache->flags);
-		if (btrfs_chunk_readonly(info, cache->key.objectid)) {
-			inc_block_group_ro(cache, 1);
-		} else if (btrfs_block_group_used(&cache->item) == 0) {
-			ASSERT(list_empty(&cache->bg_list));
-			btrfs_mark_bg_unused(cache);
-		}
 	}
 
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
-- 
2.23.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root
  2019-10-08  4:49 [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups() Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-08  8:30   ` Johannes Thumshirn
  2019-10-10  2:00   ` Anand Jain
  2019-10-08  4:49 ` [PATCH v2 3/3] btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time Qu Wenruo
  2019-10-08  9:14 ` [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce " Johannes Thumshirn
  3 siblings, 2 replies; 19+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

In free_root_pointers(), free_root_extent_buffers() has checked if the
@root parameter is NULL.
So there is no need checking chunk_root before freeing it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 044981cf6df9..bfeeac83b952 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2038,8 +2038,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root)
 	free_root_extent_buffers(info->csum_root);
 	free_root_extent_buffers(info->quota_root);
 	free_root_extent_buffers(info->uuid_root);
-	if (chunk_root)
-		free_root_extent_buffers(info->chunk_root);
+	free_root_extent_buffers(info->chunk_root);
 	free_root_extent_buffers(info->free_space_root);
 }
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 3/3] btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time
  2019-10-08  4:49 [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups() Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-08  9:09   ` Johannes Thumshirn
  2019-10-08  9:14 ` [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce " Johannes Thumshirn
  3 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

The overall idea of the new BG_TREE is pretty simple:
Put BLOCK_GROUP_ITEMS into a separate tree.

This brings one obvious enhancement:
- Reduce mount time of large fs

Although it could be possible to accept BLOCK_GROUP_ITEMS in either
trees (extent root or bg root), I'll leave that kernel convert as
alternatives to offline convert, as next step if there are a lot of
interests in that.

So for now, if an existing fs want to take advantage of BG_TREE feature,
btrfs-progs will provide offline convertion tool.

[[Benchmark]]
Physical device:	NVMe SSD
VM device:		VirtIO block device, backup by sparse file
Nodesize:		4K  (to bump up tree height)
Extent data size:	4M
Fs size used:		1T

All file extents on disk is in 4M size, preallocated to reduce space usage
(as the VM uses loopback block device backed by sparse file)

Without patchset:
Use ftrace function graph:

 7)               |  open_ctree [btrfs]() {
 7)               |    btrfs_read_block_groups [btrfs]() {
 7) @ 805851.8 us |    }
 7) @ 911890.2 us |  }

 btrfs_read_block_groups() takes 88% of the total mount time,

With patchset, and use -O bg-tree mkfs option:

 6)               |  open_ctree [btrfs]() {
 6)               |    btrfs_read_block_groups [btrfs]() {
 6) * 91204.69 us |    }
 6) @ 192039.5 us |  }

  open_ctree() time is only 21% of original mount time.
  And btrfs_read_block_groups() only takes 47% of total open_ctree()
  execution time.

The reason is pretty obvious when considering how many tree blocks needs
to be read from disk:
- Original extent tree:
  nodes:	55
  leaves:	1025
  total:	1080
- Block group tree:
  nodes:	1
  leaves:	13
  total:	14

Not to mention all the tree blocks readahead works pretty fine for bg
tree, as we will read every item.
While readahead for extent tree will just be a diaster, as all block
groups are scatter across the whole extent tree.

The reduction of mount time is already obvious even on super fast NVMe
disk with memory cache.
It would be even more obvious if the fs is on spinning rust.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c          | 104 ++++++++++++++++++++++++++------
 fs/btrfs/ctree.h                |   5 +-
 fs/btrfs/disk-io.c              |  13 ++++
 fs/btrfs/sysfs.c                |   2 +
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |   3 +
 6 files changed, 110 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 0c5eef0610fa..0101fd42b586 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -861,7 +861,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     u64 group_start, struct extent_map *em)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_root *root;
 	struct btrfs_path *path;
 	struct btrfs_block_group_cache *block_group;
 	struct btrfs_free_cluster *cluster;
@@ -880,6 +880,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	BUG_ON(!block_group);
 	BUG_ON(!block_group->ro);
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+
 	trace_btrfs_remove_block_group(block_group);
 	/*
 	 * Free the reserved super bytes from this block group before
@@ -1790,6 +1795,56 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	return ret;
 }
 
+static int read_block_group_tree(struct btrfs_fs_info *info, int need_clear)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	int ret;
+
+	key.objectid = 0;
+	key.offset = 0;
+	key.type = 0;
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(NULL, info->bg_root, &key, path, 0, 0);
+	if (ret < 0)
+		return ret;
+	if (ret == 0) {
+		btrfs_err(info,
+			  "found invalid block group bytenr %llu len %llu",
+			  key.objectid, key.offset);
+		ret = -EUCLEAN;
+		goto out;
+	}
+
+	while (1) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (key.type != BTRFS_BLOCK_GROUP_ITEM_KEY) {
+			btrfs_err(info,
+		"found invalid key (%llu, %u, %llu) in block group tree",
+				  key.objectid, key.type, key.offset);
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		ret = read_one_block_group(info, path, need_clear);
+		if (ret < 0)
+			goto out;
+		ret = btrfs_next_item(info->bg_root, path);
+		if (ret < 0)
+			goto out;
+		if (ret > 0) {
+			ret = 0;
+			goto out;
+		}
+	}
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -1815,20 +1870,26 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	if (btrfs_test_opt(info, CLEAR_CACHE))
 		need_clear = 1;
 
-	while (1) {
-		ret = find_first_block_group(info, path, &key);
-		if (ret > 0)
-			break;
-		if (ret != 0)
-			goto error;
-
-		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-		ret = read_one_block_group(info, path, need_clear);
+	if (btrfs_fs_incompat(info, BG_TREE)) {
+		ret = read_block_group_tree(info, need_clear);
 		if (ret < 0)
 			goto error;
-		key.objectid += key.offset;
-		key.offset = 0;
-		btrfs_release_path(path);
+	} else {
+		while (1) {
+			ret = find_first_block_group(info, path, &key);
+			if (ret > 0)
+				break;
+			if (ret != 0)
+				goto error;
+
+			btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+			ret = read_one_block_group(info, path, need_clear);
+			if (ret < 0)
+				goto error;
+			key.objectid += key.offset;
+			key.offset = 0;
+			btrfs_release_path(path);
+		}
 	}
 
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
@@ -1863,7 +1924,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_block_group_cache *block_group;
-	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_root *root;
 	struct btrfs_block_group_item item;
 	struct btrfs_key key;
 	int ret = 0;
@@ -1871,6 +1932,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 	if (!trans->can_flush_pending_bgs)
 		return;
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+
 	while (!list_empty(&trans->new_bgs)) {
 		block_group = list_first_entry(&trans->new_bgs,
 					       struct btrfs_block_group_cache,
@@ -1883,7 +1949,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		memcpy(&key, &block_group->key, sizeof(key));
 		spin_unlock(&block_group->lock);
 
-		ret = btrfs_insert_item(trans, extent_root, &key, &item,
+		ret = btrfs_insert_item(trans, root, &key, &item,
 					sizeof(item));
 		if (ret)
 			btrfs_abort_transaction(trans, ret);
@@ -2119,11 +2185,15 @@ static int write_one_cache_group(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	int ret;
-	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_root *root;
 	unsigned long bi;
 	struct extent_buffer *leaf;
 
-	ret = btrfs_search_slot(trans, extent_root, &cache->key, path, 0, 1);
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+	ret = btrfs_search_slot(trans, root, &cache->key, path, 0, 1);
 	if (ret) {
 		if (ret > 0)
 			ret = -ENOENT;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 19d669d12ca1..c98d008fb42f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -291,7 +291,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
-	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
+	 BTRFS_FEATURE_INCOMPAT_BG_TREE)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
@@ -538,6 +539,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *quota_root;
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *free_space_root;
+	struct btrfs_root *bg_root;
 
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
@@ -2661,6 +2663,7 @@ static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 	kfree(fs_info->quota_root);
 	kfree(fs_info->uuid_root);
 	kfree(fs_info->free_space_root);
+	kfree(fs_info->bg_root);
 	kfree(fs_info->super_copy);
 	kfree(fs_info->super_for_commit);
 	kvfree(fs_info);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bfeeac83b952..b9693d82bb6f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1570,6 +1570,8 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
 		return fs_info->free_space_root ? fs_info->free_space_root :
 						  ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_BLOCK_GROUP_TREE_OBJECTID)
+		return fs_info->bg_root ? fs_info->bg_root: ERR_PTR(-ENOENT);
 again:
 	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root) {
@@ -2040,6 +2042,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root)
 	free_root_extent_buffers(info->uuid_root);
 	free_root_extent_buffers(info->chunk_root);
 	free_root_extent_buffers(info->free_space_root);
+	free_root_extent_buffers(info->bg_root);
 }
 
 void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info)
@@ -2332,6 +2335,16 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
 	fs_info->extent_root = root;
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE)) {
+		location.objectid = BTRFS_BLOCK_GROUP_TREE_OBJECTID;
+		root = btrfs_read_tree_root(tree_root, &location);
+		if (IS_ERR(root)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->bg_root = root;
+	}
 	location.objectid = BTRFS_DEV_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index f6d3c80f2e28..799f7e18d6ee 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -257,6 +257,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
 BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
+BTRFS_FEAT_ATTR_INCOMPAT(bg_tree, BG_TREE);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
@@ -272,6 +273,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(no_holes),
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
+	BTRFS_FEAT_ATTR_PTR(bg_tree),
 	NULL
 };
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 3ee0678c0a83..0f91ebc07064 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -270,6 +270,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
+#define BTRFS_FEATURE_INCOMPAT_BG_TREE		(1ULL << 11)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index b65c7ee75bc7..1d1e50c42d0e 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -48,6 +48,9 @@
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* sotre BLOCK_GROUP_ITEMs in a seprate tree */
+#define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root
  2019-10-08  4:49 ` [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root Qu Wenruo
@ 2019-10-08  8:30   ` Johannes Thumshirn
  2019-10-10  2:00   ` Anand Jain
  1 sibling, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-10-08  8:30 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups()
  2019-10-08  4:49 ` [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups() Qu Wenruo
@ 2019-10-08  9:08   ` Johannes Thumshirn
  2019-10-09 12:08   ` Anand Jain
  2019-10-09 13:07   ` [PATCH " Qu Wenruo
  2 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-10-08  9:08 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good, I especially like the consolidation of the
btrfs_put_block_group(cache) calls.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time
  2019-10-08  4:49 ` [PATCH v2 3/3] btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time Qu Wenruo
@ 2019-10-08  9:09   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-10-08  9:09 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
  2019-10-08  4:49 [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
                   ` (2 preceding siblings ...)
  2019-10-08  4:49 ` [PATCH v2 3/3] btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time Qu Wenruo
@ 2019-10-08  9:14 ` " Johannes Thumshirn
  2019-10-08  9:26   ` Qu Wenruo
  3 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2019-10-08  9:14 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: felix

> [[Benchmark]]
> Since I have upgraded my rig to all NVME storage, there is no HDD
> test result.
> 
> Physical device:	NVMe SSD
> VM device:		VirtIO block device, backup by sparse file
> Nodesize:		4K  (to bump up tree height)
> Extent data size:	4M
> Fs size used:		1T
> 
> All file extents on disk is in 4M size, preallocated to reduce space usage
> (as the VM uses loopback block device backed by sparse file)

Do you have a some additional details about the test setup? I tried to
do the same (testing) for a bug Felix (added to Cc) reported to my at
the ALPSS Conference and I couldn't reproduce the issue.

My testing was a 100TB sparse file passed into a VM and running this
script to touch all blockgroups:

#!/bin/sh

FILE=/mnt/test

add_dirty_bg() {
        off="$1"
        len="$2"
        touch $FILE
        xfs_io -c "falloc $off $len" $FILE
        rm $FILE
}

mkfs.btrfs /dev/vda
mount /dev/vda /mnt

for ((i = 1; i < 100000; i++)); do
        add_dirty_bg $i"G" "1G"
done

umount /mnt



-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
  2019-10-08  9:14 ` [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce " Johannes Thumshirn
@ 2019-10-08  9:26   ` Qu Wenruo
  2019-10-08  9:47     ` Johannes Thumshirn
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2019-10-08  9:26 UTC (permalink / raw)
  To: Johannes Thumshirn, Qu Wenruo, linux-btrfs; +Cc: felix

[-- Attachment #1.1: Type: text/plain, Size: 2660 bytes --]



On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>> [[Benchmark]]
>> Since I have upgraded my rig to all NVME storage, there is no HDD
>> test result.
>>
>> Physical device:	NVMe SSD
>> VM device:		VirtIO block device, backup by sparse file
>> Nodesize:		4K  (to bump up tree height)
>> Extent data size:	4M
>> Fs size used:		1T
>>
>> All file extents on disk is in 4M size, preallocated to reduce space usage
>> (as the VM uses loopback block device backed by sparse file)
> 
> Do you have a some additional details about the test setup? I tried to
> do the same (testing) for a bug Felix (added to Cc) reported to my at
> the ALPSS Conference and I couldn't reproduce the issue.
> 
> My testing was a 100TB sparse file passed into a VM and running this
> script to touch all blockgroups:

Here is my test scripts:
---
#!/bin/bash

dev="/dev/vdb"
mnt="/mnt/btrfs"

nr_subv=16
nr_extents=16384
extent_size=$((4 * 1024 * 1024)) # 4M

_fail()
{
        echo "!!! FAILED: $@ !!!"
        exit 1
}

fill_one_subv()
{
        path=$1
        if [ -z $path ]; then
                _fail "wrong parameter for fill_one_subv"
        fi
        btrfs subv create $path || _fail "create subv"

        for i in $(seq 0 $((nr_extents - 1))); do
                fallocate -o $((i * $extent_size)) -l $extent_size
$path/file || _fail "fallocate"
        done
}

declare -a pids
umount $mnt &> /dev/null
umount $dev &> /dev/null

#~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
mkfs.btrfs -f -n 4k $dev
mount $dev $mnt -o nospace_cache

for i in $(seq 1 $nr_subv); do
        fill_one_subv $mnt/subv_${i} &
        pids[$i]=$!
done

for i in $(seq 1 $nr_subv); do
        wait ${pids[$i]}
done
sync
umount $dev

---

> 
> #!/bin/sh
> 
> FILE=/mnt/test
> 
> add_dirty_bg() {
>         off="$1"
>         len="$2"
>         touch $FILE
>         xfs_io -c "falloc $off $len" $FILE
>         rm $FILE
> }
> 
> mkfs.btrfs /dev/vda
> mount /dev/vda /mnt
> 
> for ((i = 1; i < 100000; i++)); do
>         add_dirty_bg $i"G" "1G"
> done

This wont really build a good enough extent tree layout.

1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.

Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
together.

To build a case to really show the problem, you'll need a lot of
EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.

My test scripts did that, but may still not represent the real world, as
real world can cause even smaller extents due to snapshots.

Thanks,
Qu

> 
> umount /mnt
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
  2019-10-08  9:26   ` Qu Wenruo
@ 2019-10-08  9:47     ` Johannes Thumshirn
       [not found]       ` <b4821d86-eeb9-f21c-66aa-480df2d3a13d@feldspaten.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2019-10-08  9:47 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs; +Cc: felix

[-- Attachment #1.1: Type: text/plain, Size: 3213 bytes --]

On 08/10/2019 11:26, Qu Wenruo wrote:
> 
> 
> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>> [[Benchmark]]
>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>> test result.
>>>
>>> Physical device:	NVMe SSD
>>> VM device:		VirtIO block device, backup by sparse file
>>> Nodesize:		4K  (to bump up tree height)
>>> Extent data size:	4M
>>> Fs size used:		1T
>>>
>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>> (as the VM uses loopback block device backed by sparse file)
>>
>> Do you have a some additional details about the test setup? I tried to
>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>> the ALPSS Conference and I couldn't reproduce the issue.
>>
>> My testing was a 100TB sparse file passed into a VM and running this
>> script to touch all blockgroups:
> 
> Here is my test scripts:
> ---
> #!/bin/bash
> 
> dev="/dev/vdb"
> mnt="/mnt/btrfs"
> 
> nr_subv=16
> nr_extents=16384
> extent_size=$((4 * 1024 * 1024)) # 4M
> 
> _fail()
> {
>         echo "!!! FAILED: $@ !!!"
>         exit 1
> }
> 
> fill_one_subv()
> {
>         path=$1
>         if [ -z $path ]; then
>                 _fail "wrong parameter for fill_one_subv"
>         fi
>         btrfs subv create $path || _fail "create subv"
> 
>         for i in $(seq 0 $((nr_extents - 1))); do
>                 fallocate -o $((i * $extent_size)) -l $extent_size
> $path/file || _fail "fallocate"
>         done
> }
> 
> declare -a pids
> umount $mnt &> /dev/null
> umount $dev &> /dev/null
> 
> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
> mkfs.btrfs -f -n 4k $dev
> mount $dev $mnt -o nospace_cache
> 
> for i in $(seq 1 $nr_subv); do
>         fill_one_subv $mnt/subv_${i} &
>         pids[$i]=$!
> done
> 
> for i in $(seq 1 $nr_subv); do
>         wait ${pids[$i]}
> done
> sync
> umount $dev
> 
> ---
> 
>>
>> #!/bin/sh
>>
>> FILE=/mnt/test
>>
>> add_dirty_bg() {
>>         off="$1"
>>         len="$2"
>>         touch $FILE
>>         xfs_io -c "falloc $off $len" $FILE
>>         rm $FILE
>> }
>>
>> mkfs.btrfs /dev/vda
>> mount /dev/vda /mnt
>>
>> for ((i = 1; i < 100000; i++)); do
>>         add_dirty_bg $i"G" "1G"
>> done
> 
> This wont really build a good enough extent tree layout.
> 
> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
> 
> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
> together.
> 
> To build a case to really show the problem, you'll need a lot of
> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
> 
> My test scripts did that, but may still not represent the real world, as
> real world can cause even smaller extents due to snapshots.
> 

Ah thanks for the explanation. I'll give your testscript a try.


-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
       [not found]       ` <b4821d86-eeb9-f21c-66aa-480df2d3a13d@feldspaten.org>
@ 2019-10-09  7:43         ` Qu Wenruo
  2019-10-09  8:08           ` Felix Niederwanger
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2019-10-09  7:43 UTC (permalink / raw)
  To: Felix Niederwanger, Johannes Thumshirn, Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 6092 bytes --]



On 2019/10/9 下午3:07, Felix Niederwanger wrote:
> Hey Johannes,
> 
> glad to hear back from you :-)
> 
> As discussed I try to elaborate the setup where we experienced the
> issue, that btrfs mount takes more than 5 minutes. The initial bug
> report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865
> 
> Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller
>                              Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD
> Installed System:            OPENSUSE LEAP 15.1
> Disks:                       / is on a separate DOM
>                              /dev/sda1 is the affected btrfs volume
> 
> Disk layout
> 
> sda      8:0    0 98.2T  0 disk 
> └─sda1   8:1    0 81.9T  0 part /ESO-RAID

How much space is used?

With enough space used (especially your tens of TB used), it's pretty
easy to have too many block groups items to overload the extent tree.

There is a better tool to explore the problem easier:
https://github.com/adam900710/btrfs-progs/tree/account_bgs

You can compile the btrfs-corrupt-block tool, and then:
# ./btrfs-corrupt-block -X /dev/sda1

It's recommended to call it with fs unmounted.

Then it should output something like:
extent_tree: total=1080 leaves=1025

Then please post that line to surprise us.

It shows how many unique tree blocks are needed to be read from disk,
just for iterating the block group items.

You could consider it as how many random IO needs to be done in nodesize
(normally 16K).

> sdb      8:16   0   59G  0 disk 
> ├─sdb1   8:17   0    1G  0 part /boot
> ├─sdb2   8:18   0  5.9G  0 part [SWAP]
> └─sdb3   8:19   0 52.1G  0 part /
> 
> System configuration : Opensuse LEAP 15.1 with "Server" configuration,
> installed NFS server.
> 
> I copied data from the old NAS (separate server, xfs volume) to the new
> btrfs volume using rsync.

If you are willing to/have enough spare space to test, you could try
that my latest bg-tree feature, to see if it would solve the problem.

My not-so-optimized guess that feature would reduce mount time to around
1min.
My average guess is, around 30s.

Thanks,
Qu

> Then I performed a system update with zypper, rebooted and run into the
> problems described in
> https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot
> failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual
> mount worked fine (but took up to 6 minutes) and the filesystem was
> completely unresponsive. See the bug report for more details about what
> became unresponsive.
> 
> A movie of the failing boot process is still on my webserver:
> ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4
> 
> 
> I hope this contributes to reproduce the issue. Feel free to contact me
> if you need further details,
> 
> Greetings,
> Felix :-)
> 
> 
> On 10/8/19 11:47 AM, Johannes Thumshirn wrote:
>> On 08/10/2019 11:26, Qu Wenruo wrote:
>>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>>>> [[Benchmark]]
>>>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>>>> test result.
>>>>>
>>>>> Physical device:	NVMe SSD
>>>>> VM device:		VirtIO block device, backup by sparse file
>>>>> Nodesize:		4K  (to bump up tree height)
>>>>> Extent data size:	4M
>>>>> Fs size used:		1T
>>>>>
>>>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>>>> (as the VM uses loopback block device backed by sparse file)
>>>> Do you have a some additional details about the test setup? I tried to
>>>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>>>> the ALPSS Conference and I couldn't reproduce the issue.
>>>>
>>>> My testing was a 100TB sparse file passed into a VM and running this
>>>> script to touch all blockgroups:
>>> Here is my test scripts:
>>> ---
>>> #!/bin/bash
>>>
>>> dev="/dev/vdb"
>>> mnt="/mnt/btrfs"
>>>
>>> nr_subv=16
>>> nr_extents=16384
>>> extent_size=$((4 * 1024 * 1024)) # 4M
>>>
>>> _fail()
>>> {
>>>         echo "!!! FAILED: $@ !!!"
>>>         exit 1
>>> }
>>>
>>> fill_one_subv()
>>> {
>>>         path=$1
>>>         if [ -z $path ]; then
>>>                 _fail "wrong parameter for fill_one_subv"
>>>         fi
>>>         btrfs subv create $path || _fail "create subv"
>>>
>>>         for i in $(seq 0 $((nr_extents - 1))); do
>>>                 fallocate -o $((i * $extent_size)) -l $extent_size
>>> $path/file || _fail "fallocate"
>>>         done
>>> }
>>>
>>> declare -a pids
>>> umount $mnt &> /dev/null
>>> umount $dev &> /dev/null
>>>
>>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
>>> mkfs.btrfs -f -n 4k $dev
>>> mount $dev $mnt -o nospace_cache
>>>
>>> for i in $(seq 1 $nr_subv); do
>>>         fill_one_subv $mnt/subv_${i} &
>>>         pids[$i]=$!
>>> done
>>>
>>> for i in $(seq 1 $nr_subv); do
>>>         wait ${pids[$i]}
>>> done
>>> sync
>>> umount $dev
>>>
>>> ---
>>>
>>>> #!/bin/sh
>>>>
>>>> FILE=/mnt/test
>>>>
>>>> add_dirty_bg() {
>>>>         off="$1"
>>>>         len="$2"
>>>>         touch $FILE
>>>>         xfs_io -c "falloc $off $len" $FILE
>>>>         rm $FILE
>>>> }
>>>>
>>>> mkfs.btrfs /dev/vda
>>>> mount /dev/vda /mnt
>>>>
>>>> for ((i = 1; i < 100000; i++)); do
>>>>         add_dirty_bg $i"G" "1G"
>>>> done
>>> This wont really build a good enough extent tree layout.
>>>
>>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
>>>
>>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
>>> together.
>>>
>>> To build a case to really show the problem, you'll need a lot of
>>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
>>>
>>> My test scripts did that, but may still not represent the real world, as
>>> real world can cause even smaller extents due to snapshots.
>>>
>> Ah thanks for the explanation. I'll give your testscript a try.
>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
  2019-10-09  7:43         ` Qu Wenruo
@ 2019-10-09  8:08           ` Felix Niederwanger
  2019-10-09 11:00             ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Felix Niederwanger @ 2019-10-09  8:08 UTC (permalink / raw)
  To: Qu Wenruo, Johannes Thumshirn, Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 7666 bytes --]

Hi Qu,

I'm afraid the system is now in a very different configuration and
already in production, which makes it impossible to run specific tests
on it.

At the time the problem occurred, we were using about 57/82 TB filled
with approx 20-30 million individual files with varying file sized. I
created a histogram of the most common unsed filesizes for a small subset:

# find . -type f -print0 | xargs -0 ls -l | awk
'{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n",
2^i, size[i])}' | sort -n

         0 7992
         1 3072
         2   4
         8 1536
       128   2
       512  22
      1024 4600
      2048 3341
      4096 5671
      8192 940
     16384 6535
     32768 700
     65536  17
    131072 3843
    262144 3362
    524288 2143
   1048576 1169
   2097152 856
   4194304 168
   8388608 5579
  16777216 4052
  33554432 604
  67108864 890

26240 out of 57098 files (45%) are <=4k in size. This should be
representative for most of the files on the affected volume.

The affected system is now in production (xfs on the same RAID6) and as
I left university, it's unfortunately impossible to run any more tests
on that particular system.

Greetings,
Felix


On 10/9/19 9:43 AM, Qu Wenruo wrote:
>
> On 2019/10/9 下午3:07, Felix Niederwanger wrote:
>> Hey Johannes,
>>
>> glad to hear back from you :-)
>>
>> As discussed I try to elaborate the setup where we experienced the
>> issue, that btrfs mount takes more than 5 minutes. The initial bug
>> report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865
>>
>> Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller
>>                              Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD
>> Installed System:            OPENSUSE LEAP 15.1
>> Disks:                       / is on a separate DOM
>>                              /dev/sda1 is the affected btrfs volume
>>
>> Disk layout
>>
>> sda      8:0    0 98.2T  0 disk 
>> └─sda1   8:1    0 81.9T  0 part /ESO-RAID
> How much space is used?
>
> With enough space used (especially your tens of TB used), it's pretty
> easy to have too many block groups items to overload the extent tree.
>
> There is a better tool to explore the problem easier:
> https://github.com/adam900710/btrfs-progs/tree/account_bgs
>
> You can compile the btrfs-corrupt-block tool, and then:
> # ./btrfs-corrupt-block -X /dev/sda1
>
> It's recommended to call it with fs unmounted.
>
> Then it should output something like:
> extent_tree: total=1080 leaves=1025
>
> Then please post that line to surprise us.
>
> It shows how many unique tree blocks are needed to be read from disk,
> just for iterating the block group items.
>
> You could consider it as how many random IO needs to be done in nodesize
> (normally 16K).
>
>> sdb      8:16   0   59G  0 disk 
>> ├─sdb1   8:17   0    1G  0 part /boot
>> ├─sdb2   8:18   0  5.9G  0 part [SWAP]
>> └─sdb3   8:19   0 52.1G  0 part /
>>
>> System configuration : Opensuse LEAP 15.1 with "Server" configuration,
>> installed NFS server.
>>
>> I copied data from the old NAS (separate server, xfs volume) to the new
>> btrfs volume using rsync.
> If you are willing to/have enough spare space to test, you could try
> that my latest bg-tree feature, to see if it would solve the problem.
>
> My not-so-optimized guess that feature would reduce mount time to around
> 1min.
> My average guess is, around 30s.
>
> Thanks,
> Qu
>
>> Then I performed a system update with zypper, rebooted and run into the
>> problems described in
>> https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot
>> failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual
>> mount worked fine (but took up to 6 minutes) and the filesystem was
>> completely unresponsive. See the bug report for more details about what
>> became unresponsive.
>>
>> A movie of the failing boot process is still on my webserver:
>> ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4
>>
>>
>> I hope this contributes to reproduce the issue. Feel free to contact me
>> if you need further details,
>>
>> Greetings,
>> Felix :-)
>>
>>
>> On 10/8/19 11:47 AM, Johannes Thumshirn wrote:
>>> On 08/10/2019 11:26, Qu Wenruo wrote:
>>>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>>>>> [[Benchmark]]
>>>>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>>>>> test result.
>>>>>>
>>>>>> Physical device:	NVMe SSD
>>>>>> VM device:		VirtIO block device, backup by sparse file
>>>>>> Nodesize:		4K  (to bump up tree height)
>>>>>> Extent data size:	4M
>>>>>> Fs size used:		1T
>>>>>>
>>>>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>>>>> (as the VM uses loopback block device backed by sparse file)
>>>>> Do you have a some additional details about the test setup? I tried to
>>>>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>>>>> the ALPSS Conference and I couldn't reproduce the issue.
>>>>>
>>>>> My testing was a 100TB sparse file passed into a VM and running this
>>>>> script to touch all blockgroups:
>>>> Here is my test scripts:
>>>> ---
>>>> #!/bin/bash
>>>>
>>>> dev="/dev/vdb"
>>>> mnt="/mnt/btrfs"
>>>>
>>>> nr_subv=16
>>>> nr_extents=16384
>>>> extent_size=$((4 * 1024 * 1024)) # 4M
>>>>
>>>> _fail()
>>>> {
>>>>         echo "!!! FAILED: $@ !!!"
>>>>         exit 1
>>>> }
>>>>
>>>> fill_one_subv()
>>>> {
>>>>         path=$1
>>>>         if [ -z $path ]; then
>>>>                 _fail "wrong parameter for fill_one_subv"
>>>>         fi
>>>>         btrfs subv create $path || _fail "create subv"
>>>>
>>>>         for i in $(seq 0 $((nr_extents - 1))); do
>>>>                 fallocate -o $((i * $extent_size)) -l $extent_size
>>>> $path/file || _fail "fallocate"
>>>>         done
>>>> }
>>>>
>>>> declare -a pids
>>>> umount $mnt &> /dev/null
>>>> umount $dev &> /dev/null
>>>>
>>>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
>>>> mkfs.btrfs -f -n 4k $dev
>>>> mount $dev $mnt -o nospace_cache
>>>>
>>>> for i in $(seq 1 $nr_subv); do
>>>>         fill_one_subv $mnt/subv_${i} &
>>>>         pids[$i]=$!
>>>> done
>>>>
>>>> for i in $(seq 1 $nr_subv); do
>>>>         wait ${pids[$i]}
>>>> done
>>>> sync
>>>> umount $dev
>>>>
>>>> ---
>>>>
>>>>> #!/bin/sh
>>>>>
>>>>> FILE=/mnt/test
>>>>>
>>>>> add_dirty_bg() {
>>>>>         off="$1"
>>>>>         len="$2"
>>>>>         touch $FILE
>>>>>         xfs_io -c "falloc $off $len" $FILE
>>>>>         rm $FILE
>>>>> }
>>>>>
>>>>> mkfs.btrfs /dev/vda
>>>>> mount /dev/vda /mnt
>>>>>
>>>>> for ((i = 1; i < 100000; i++)); do
>>>>>         add_dirty_bg $i"G" "1G"
>>>>> done
>>>> This wont really build a good enough extent tree layout.
>>>>
>>>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
>>>>
>>>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
>>>> together.
>>>>
>>>> To build a case to really show the problem, you'll need a lot of
>>>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
>>>>
>>>> My test scripts did that, but may still not represent the real world, as
>>>> real world can cause even smaller extents due to snapshots.
>>>>
>>> Ah thanks for the explanation. I'll give your testscript a try.
>>>
>>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
  2019-10-09  8:08           ` Felix Niederwanger
@ 2019-10-09 11:00             ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2019-10-09 11:00 UTC (permalink / raw)
  To: Felix Niederwanger, Johannes Thumshirn, Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 8664 bytes --]



On 2019/10/9 下午4:08, Felix Niederwanger wrote:
> Hi Qu,
> 
> I'm afraid the system is now in a very different configuration and
> already in production, which makes it impossible to run specific tests
> on it.
> 
> At the time the problem occurred, we were using about 57/82 TB filled
> with approx 20-30 million individual files with varying file sized. I
> created a histogram of the most common unsed filesizes for a small subset:
> 
> # find . -type f -print0 | xargs -0 ls -l | awk
> '{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n",
> 2^i, size[i])}' | sort -n
> 
>          0 7992
>          1 3072
>          2   4
>          8 1536
>        128   2
>        512  22
>       1024 4600
>       2048 3341
>       4096 5671
>       8192 940
>      16384 6535
>      32768 700
>      65536  17
>     131072 3843
>     262144 3362
>     524288 2143
>    1048576 1169
>    2097152 856
>    4194304 168
>    8388608 5579
>   16777216 4052
>   33554432 604
>   67108864 890
> 
> 26240 out of 57098 files (45%) are <=4k in size. This should be
> representative for most of the files on the affected volume.

That explains the problem.

There are 3 main factors contributing to the mount time:
- Number of block groups
  This directly affects how many tree search we need to do.

- Number of extents
  This affects how random the block group iteration will be.

- Disk IOPS performance
  Obviously, since bg iteration is mostly random IO, it's IOPS affecting
  the overall mount time.

In your case, your fs seems to have all the boxes checked.

Anyway, it still stands inside the assumption I have, although it's a
pity that we can't get a real world benchmark, it still contributes to
the motivation of bg-tree feature.

Thanks,
Qu
> 
> The affected system is now in production (xfs on the same RAID6) and as
> I left university, it's unfortunately impossible to run any more tests
> on that particular system.
> 
> Greetings,
> Felix
> 
> 
> On 10/9/19 9:43 AM, Qu Wenruo wrote:
>>
>> On 2019/10/9 下午3:07, Felix Niederwanger wrote:
>>> Hey Johannes,
>>>
>>> glad to hear back from you :-)
>>>
>>> As discussed I try to elaborate the setup where we experienced the
>>> issue, that btrfs mount takes more than 5 minutes. The initial bug
>>> report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865
>>>
>>> Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller
>>>                              Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD
>>> Installed System:            OPENSUSE LEAP 15.1
>>> Disks:                       / is on a separate DOM
>>>                              /dev/sda1 is the affected btrfs volume
>>>
>>> Disk layout
>>>
>>> sda      8:0    0 98.2T  0 disk 
>>> └─sda1   8:1    0 81.9T  0 part /ESO-RAID
>> How much space is used?
>>
>> With enough space used (especially your tens of TB used), it's pretty
>> easy to have too many block groups items to overload the extent tree.
>>
>> There is a better tool to explore the problem easier:
>> https://github.com/adam900710/btrfs-progs/tree/account_bgs
>>
>> You can compile the btrfs-corrupt-block tool, and then:
>> # ./btrfs-corrupt-block -X /dev/sda1
>>
>> It's recommended to call it with fs unmounted.
>>
>> Then it should output something like:
>> extent_tree: total=1080 leaves=1025
>>
>> Then please post that line to surprise us.
>>
>> It shows how many unique tree blocks are needed to be read from disk,
>> just for iterating the block group items.
>>
>> You could consider it as how many random IO needs to be done in nodesize
>> (normally 16K).
>>
>>> sdb      8:16   0   59G  0 disk 
>>> ├─sdb1   8:17   0    1G  0 part /boot
>>> ├─sdb2   8:18   0  5.9G  0 part [SWAP]
>>> └─sdb3   8:19   0 52.1G  0 part /
>>>
>>> System configuration : Opensuse LEAP 15.1 with "Server" configuration,
>>> installed NFS server.
>>>
>>> I copied data from the old NAS (separate server, xfs volume) to the new
>>> btrfs volume using rsync.
>> If you are willing to/have enough spare space to test, you could try
>> that my latest bg-tree feature, to see if it would solve the problem.
>>
>> My not-so-optimized guess that feature would reduce mount time to around
>> 1min.
>> My average guess is, around 30s.
>>
>> Thanks,
>> Qu
>>
>>> Then I performed a system update with zypper, rebooted and run into the
>>> problems described in
>>> https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot
>>> failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual
>>> mount worked fine (but took up to 6 minutes) and the filesystem was
>>> completely unresponsive. See the bug report for more details about what
>>> became unresponsive.
>>>
>>> A movie of the failing boot process is still on my webserver:
>>> ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4
>>>
>>>
>>> I hope this contributes to reproduce the issue. Feel free to contact me
>>> if you need further details,
>>>
>>> Greetings,
>>> Felix :-)
>>>
>>>
>>> On 10/8/19 11:47 AM, Johannes Thumshirn wrote:
>>>> On 08/10/2019 11:26, Qu Wenruo wrote:
>>>>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>>>>>> [[Benchmark]]
>>>>>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>>>>>> test result.
>>>>>>>
>>>>>>> Physical device:	NVMe SSD
>>>>>>> VM device:		VirtIO block device, backup by sparse file
>>>>>>> Nodesize:		4K  (to bump up tree height)
>>>>>>> Extent data size:	4M
>>>>>>> Fs size used:		1T
>>>>>>>
>>>>>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>>>>>> (as the VM uses loopback block device backed by sparse file)
>>>>>> Do you have a some additional details about the test setup? I tried to
>>>>>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>>>>>> the ALPSS Conference and I couldn't reproduce the issue.
>>>>>>
>>>>>> My testing was a 100TB sparse file passed into a VM and running this
>>>>>> script to touch all blockgroups:
>>>>> Here is my test scripts:
>>>>> ---
>>>>> #!/bin/bash
>>>>>
>>>>> dev="/dev/vdb"
>>>>> mnt="/mnt/btrfs"
>>>>>
>>>>> nr_subv=16
>>>>> nr_extents=16384
>>>>> extent_size=$((4 * 1024 * 1024)) # 4M
>>>>>
>>>>> _fail()
>>>>> {
>>>>>         echo "!!! FAILED: $@ !!!"
>>>>>         exit 1
>>>>> }
>>>>>
>>>>> fill_one_subv()
>>>>> {
>>>>>         path=$1
>>>>>         if [ -z $path ]; then
>>>>>                 _fail "wrong parameter for fill_one_subv"
>>>>>         fi
>>>>>         btrfs subv create $path || _fail "create subv"
>>>>>
>>>>>         for i in $(seq 0 $((nr_extents - 1))); do
>>>>>                 fallocate -o $((i * $extent_size)) -l $extent_size
>>>>> $path/file || _fail "fallocate"
>>>>>         done
>>>>> }
>>>>>
>>>>> declare -a pids
>>>>> umount $mnt &> /dev/null
>>>>> umount $dev &> /dev/null
>>>>>
>>>>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
>>>>> mkfs.btrfs -f -n 4k $dev
>>>>> mount $dev $mnt -o nospace_cache
>>>>>
>>>>> for i in $(seq 1 $nr_subv); do
>>>>>         fill_one_subv $mnt/subv_${i} &
>>>>>         pids[$i]=$!
>>>>> done
>>>>>
>>>>> for i in $(seq 1 $nr_subv); do
>>>>>         wait ${pids[$i]}
>>>>> done
>>>>> sync
>>>>> umount $dev
>>>>>
>>>>> ---
>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> FILE=/mnt/test
>>>>>>
>>>>>> add_dirty_bg() {
>>>>>>         off="$1"
>>>>>>         len="$2"
>>>>>>         touch $FILE
>>>>>>         xfs_io -c "falloc $off $len" $FILE
>>>>>>         rm $FILE
>>>>>> }
>>>>>>
>>>>>> mkfs.btrfs /dev/vda
>>>>>> mount /dev/vda /mnt
>>>>>>
>>>>>> for ((i = 1; i < 100000; i++)); do
>>>>>>         add_dirty_bg $i"G" "1G"
>>>>>> done
>>>>> This wont really build a good enough extent tree layout.
>>>>>
>>>>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
>>>>>
>>>>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
>>>>> together.
>>>>>
>>>>> To build a case to really show the problem, you'll need a lot of
>>>>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
>>>>>
>>>>> My test scripts did that, but may still not represent the real world, as
>>>>> real world can cause even smaller extents due to snapshots.
>>>>>
>>>> Ah thanks for the explanation. I'll give your testscript a try.
>>>>
>>>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups()
  2019-10-08  4:49 ` [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups() Qu Wenruo
  2019-10-08  9:08   ` Johannes Thumshirn
@ 2019-10-09 12:08   ` Anand Jain
  2019-10-09 12:14     ` Qu WenRuo
  2019-10-09 13:07   ` [PATCH " Qu Wenruo
  2 siblings, 1 reply; 19+ messages in thread
From: Anand Jain @ 2019-10-09 12:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 10/8/19 12:49 PM, Qu Wenruo wrote:
> Refactor the work inside the loop of btrfs_read_block_groups() into one
> separate function, read_one_block_group().
> 
> This allows read_one_block_group to be reused for later BG_TREE feature.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Nit: Change log didn't mention about things which is also fixed here.
  1. The incompatible feature check cleanups.

+       int mixed = btrfs_fs_incompat(info, MIXED_GROUPS);

  2. The bug we failed to call "btrfs_put_block_group(cache);" at [1]

-----------------
-               if (!mixed &&
-                   ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
-                   (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
-                       btrfs_err(info,
-"bg %llu is a mixed block group but filesystem hasn't enabled mixed 
block groups",
-                                 cache->key.objectid);
-                       ret = -EINVAL;                     <---- [1]
+               btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+               ret = read_one_block_group(info, path, need_clear);
+               if (ret < 0)
                         goto error;
-               }
-----------------

Otherwise looks good.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks, Anand

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups()
  2019-10-09 12:08   ` Anand Jain
@ 2019-10-09 12:14     ` Qu WenRuo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu WenRuo @ 2019-10-09 12:14 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs



On 2019/10/9 下午8:08, Anand Jain wrote:
> On 10/8/19 12:49 PM, Qu Wenruo wrote:
>> Refactor the work inside the loop of btrfs_read_block_groups() into one
>> separate function, read_one_block_group().
>>
>> This allows read_one_block_group to be reused for later BG_TREE feature.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Nit: Change log didn't mention about things which is also fixed here.
>  1. The incompatible feature check cleanups.
> 
> +       int mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
> 
>  2. The bug we failed to call "btrfs_put_block_group(cache);" at [1]
> 
> -----------------
> -               if (!mixed &&
> -                   ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
> -                   (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
> -                       btrfs_err(info,
> -"bg %llu is a mixed block group but filesystem hasn't enabled mixed
> block groups",
> -                                 cache->key.objectid);
> -                       ret = -EINVAL;                     <---- [1]
> +               btrfs_item_key_to_cpu(path->nodes[0], &key,
> path->slots[0]);
> +               ret = read_one_block_group(info, path, need_clear);
> +               if (ret < 0)
>                         goto error;
> -               }
> -----------------

Thanks for catching that!

In fact I remembered I fixed an error out problem, but by somehow I
didn't found it in the patch/diff.

I'll send out a small update for this one and update my github branch.

Thanks,
Qu
> 
> Otherwise looks good.
> 
> Reviewed-by: Anand Jain <anand.jain@oracle.com>
> 
> Thanks, Anand
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] btrfs: block-group: Refactor btrfs_read_block_groups()
  2019-10-08  4:49 ` [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups() Qu Wenruo
  2019-10-08  9:08   ` Johannes Thumshirn
  2019-10-09 12:08   ` Anand Jain
@ 2019-10-09 13:07   ` " Qu Wenruo
  2019-10-09 14:25     ` Filipe Manana
  2 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2019-10-09 13:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Anand Jain

Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function, read_one_block_group().

This allows read_one_block_group to be reused for later BG_TREE feature.

The refactor does the following extra fixes:
- Use btrfs_fs_incompat() to replace open-coded feature check
- Fix a missing "btrfs_put_block_group(cache)" in error path

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
---
Changelog:
v2.1->v2
- Add commit message for the fixes done in the refactor
- Add reviewed-by tags
---
 fs/btrfs/block-group.c | 214 +++++++++++++++++++++--------------------
 1 file changed, 108 insertions(+), 106 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bf7e3f23bba7..0c5eef0610fa 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1687,6 +1687,109 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
+static int read_one_block_group(struct btrfs_fs_info *info,
+				struct btrfs_path *path,
+				int need_clear)
+{
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_block_group_cache *cache;
+	struct btrfs_space_info *space_info;
+	struct btrfs_key key;
+	int mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
+	int slot = path->slots[0];
+	int ret;
+
+	btrfs_item_key_to_cpu(leaf, &key, slot);
+	ASSERT(key.type == BTRFS_BLOCK_GROUP_ITEM_KEY);
+
+	cache = btrfs_create_block_group_cache(info, key.objectid,
+					       key.offset);
+	if (!cache)
+		return -ENOMEM;
+
+	if (need_clear) {
+		/*
+		 * When we mount with old space cache, we need to
+		 * set BTRFS_DC_CLEAR and set dirty flag.
+		 *
+		 * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
+		 *    truncate the old free space cache inode and
+		 *    setup a new one.
+		 * b) Setting 'dirty flag' makes sure that we flush
+		 *    the new space cache info onto disk.
+		 */
+		if (btrfs_test_opt(info, SPACE_CACHE))
+			cache->disk_cache_state = BTRFS_DC_CLEAR;
+	}
+	read_extent_buffer(leaf, &cache->item,
+			   btrfs_item_ptr_offset(leaf, slot),
+			   sizeof(cache->item));
+	cache->flags = btrfs_block_group_flags(&cache->item);
+	if (!mixed && ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
+	    (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
+			btrfs_err(info,
+"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
+				  cache->key.objectid);
+			ret = -EINVAL;
+			goto error;
+	}
+
+	/*
+	 * We need to exclude the super stripes now so that the space info has
+	 * super bytes accounted for, otherwise we'll think we have more space
+	 * than we actually do.
+	 */
+	ret = exclude_super_stripes(cache);
+	if (ret) {
+		/* We may have excluded something, so call this just in case. */
+		btrfs_free_excluded_extents(cache);
+		goto error;
+	}
+
+	/*
+	 * Check for two cases, either we are full, and therefore don't need
+	 * to bother with the caching work since we won't find any space, or we
+	 * are empty, and we can just add all the space in and be done with it.
+	 * This saves us _a_lot_ of time, particularly in the full case.
+	 */
+	if (key.offset == btrfs_block_group_used(&cache->item)) {
+		cache->last_byte_to_unpin = (u64)-1;
+		cache->cached = BTRFS_CACHE_FINISHED;
+		btrfs_free_excluded_extents(cache);
+	} else if (btrfs_block_group_used(&cache->item) == 0) {
+		cache->last_byte_to_unpin = (u64)-1;
+		cache->cached = BTRFS_CACHE_FINISHED;
+		add_new_free_space(cache, key.objectid,
+				   key.objectid + key.offset);
+		btrfs_free_excluded_extents(cache);
+	}
+	ret = btrfs_add_block_group_cache(info, cache);
+	if (ret) {
+		btrfs_remove_free_space_cache(cache);
+		goto error;
+	}
+	trace_btrfs_add_block_group(info, cache, 0);
+	btrfs_update_space_info(info, cache->flags, key.offset,
+				btrfs_block_group_used(&cache->item),
+				cache->bytes_super, &space_info);
+
+	cache->space_info = space_info;
+
+	link_block_group(cache);
+
+	set_avail_alloc_bits(info, cache->flags);
+	if (btrfs_chunk_readonly(info, cache->key.objectid)) {
+		inc_block_group_ro(cache, 1);
+	} else if (btrfs_block_group_used(&cache->item) == 0) {
+		ASSERT(list_empty(&cache->bg_list));
+		btrfs_mark_bg_unused(cache);
+	}
+	return 0;
+error:
+	btrfs_put_block_group(cache);
+	return ret;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -1694,15 +1797,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	struct btrfs_block_group_cache *cache;
 	struct btrfs_space_info *space_info;
 	struct btrfs_key key;
-	struct btrfs_key found_key;
-	struct extent_buffer *leaf;
 	int need_clear = 0;
 	u64 cache_gen;
-	u64 feature;
-	int mixed;
-
-	feature = btrfs_super_incompat_flags(info->super_copy);
-	mixed = !!(feature & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS);
 
 	key.objectid = 0;
 	key.offset = 0;
@@ -1726,107 +1822,13 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		if (ret != 0)
 			goto error;
 
-		leaf = path->nodes[0];
-		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
-
-		cache = btrfs_create_block_group_cache(info, found_key.objectid,
-						       found_key.offset);
-		if (!cache) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (need_clear) {
-			/*
-			 * When we mount with old space cache, we need to
-			 * set BTRFS_DC_CLEAR and set dirty flag.
-			 *
-			 * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
-			 *    truncate the old free space cache inode and
-			 *    setup a new one.
-			 * b) Setting 'dirty flag' makes sure that we flush
-			 *    the new space cache info onto disk.
-			 */
-			if (btrfs_test_opt(info, SPACE_CACHE))
-				cache->disk_cache_state = BTRFS_DC_CLEAR;
-		}
-
-		read_extent_buffer(leaf, &cache->item,
-				   btrfs_item_ptr_offset(leaf, path->slots[0]),
-				   sizeof(cache->item));
-		cache->flags = btrfs_block_group_flags(&cache->item);
-		if (!mixed &&
-		    ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
-		    (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
-			btrfs_err(info,
-"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
-				  cache->key.objectid);
-			ret = -EINVAL;
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		ret = read_one_block_group(info, path, need_clear);
+		if (ret < 0)
 			goto error;
-		}
-
-		key.objectid = found_key.objectid + found_key.offset;
+		key.objectid += key.offset;
+		key.offset = 0;
 		btrfs_release_path(path);
-
-		/*
-		 * We need to exclude the super stripes now so that the space
-		 * info has super bytes accounted for, otherwise we'll think
-		 * we have more space than we actually do.
-		 */
-		ret = exclude_super_stripes(cache);
-		if (ret) {
-			/*
-			 * We may have excluded something, so call this just in
-			 * case.
-			 */
-			btrfs_free_excluded_extents(cache);
-			btrfs_put_block_group(cache);
-			goto error;
-		}
-
-		/*
-		 * Check for two cases, either we are full, and therefore
-		 * don't need to bother with the caching work since we won't
-		 * find any space, or we are empty, and we can just add all
-		 * the space in and be done with it.  This saves us _a_lot_ of
-		 * time, particularly in the full case.
-		 */
-		if (found_key.offset == btrfs_block_group_used(&cache->item)) {
-			cache->last_byte_to_unpin = (u64)-1;
-			cache->cached = BTRFS_CACHE_FINISHED;
-			btrfs_free_excluded_extents(cache);
-		} else if (btrfs_block_group_used(&cache->item) == 0) {
-			cache->last_byte_to_unpin = (u64)-1;
-			cache->cached = BTRFS_CACHE_FINISHED;
-			add_new_free_space(cache, found_key.objectid,
-					   found_key.objectid +
-					   found_key.offset);
-			btrfs_free_excluded_extents(cache);
-		}
-
-		ret = btrfs_add_block_group_cache(info, cache);
-		if (ret) {
-			btrfs_remove_free_space_cache(cache);
-			btrfs_put_block_group(cache);
-			goto error;
-		}
-
-		trace_btrfs_add_block_group(info, cache, 0);
-		btrfs_update_space_info(info, cache->flags, found_key.offset,
-					btrfs_block_group_used(&cache->item),
-					cache->bytes_super, &space_info);
-
-		cache->space_info = space_info;
-
-		link_block_group(cache);
-
-		set_avail_alloc_bits(info, cache->flags);
-		if (btrfs_chunk_readonly(info, cache->key.objectid)) {
-			inc_block_group_ro(cache, 1);
-		} else if (btrfs_block_group_used(&cache->item) == 0) {
-			ASSERT(list_empty(&cache->bg_list));
-			btrfs_mark_bg_unused(cache);
-		}
 	}
 
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
-- 
2.23.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] btrfs: block-group: Refactor btrfs_read_block_groups()
  2019-10-09 13:07   ` [PATCH " Qu Wenruo
@ 2019-10-09 14:25     ` Filipe Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe Manana @ 2019-10-09 14:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn, Anand Jain

On Wed, Oct 9, 2019 at 2:09 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Refactor the work inside the loop of btrfs_read_block_groups() into one
> separate function, read_one_block_group().
>
> This allows read_one_block_group to be reused for later BG_TREE feature.
>
> The refactor does the following extra fixes:
> - Use btrfs_fs_incompat() to replace open-coded feature check
> - Fix a missing "btrfs_put_block_group(cache)" in error path

I think that (the bug fix for the error path) should go into a
separate patch, so that it can be backported without adding a refactor
(something not meant for stable releases).

Thanks.

>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Reviewed-by: Anand Jain <anand.jain@oracle.com>
> ---
> Changelog:
> v2.1->v2
> - Add commit message for the fixes done in the refactor
> - Add reviewed-by tags
> ---
>  fs/btrfs/block-group.c | 214 +++++++++++++++++++++--------------------
>  1 file changed, 108 insertions(+), 106 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index bf7e3f23bba7..0c5eef0610fa 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1687,6 +1687,109 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
>         return ret;
>  }
>
> +static int read_one_block_group(struct btrfs_fs_info *info,
> +                               struct btrfs_path *path,
> +                               int need_clear)
> +{
> +       struct extent_buffer *leaf = path->nodes[0];
> +       struct btrfs_block_group_cache *cache;
> +       struct btrfs_space_info *space_info;
> +       struct btrfs_key key;
> +       int mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
> +       int slot = path->slots[0];
> +       int ret;
> +
> +       btrfs_item_key_to_cpu(leaf, &key, slot);
> +       ASSERT(key.type == BTRFS_BLOCK_GROUP_ITEM_KEY);
> +
> +       cache = btrfs_create_block_group_cache(info, key.objectid,
> +                                              key.offset);
> +       if (!cache)
> +               return -ENOMEM;
> +
> +       if (need_clear) {
> +               /*
> +                * When we mount with old space cache, we need to
> +                * set BTRFS_DC_CLEAR and set dirty flag.
> +                *
> +                * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
> +                *    truncate the old free space cache inode and
> +                *    setup a new one.
> +                * b) Setting 'dirty flag' makes sure that we flush
> +                *    the new space cache info onto disk.
> +                */
> +               if (btrfs_test_opt(info, SPACE_CACHE))
> +                       cache->disk_cache_state = BTRFS_DC_CLEAR;
> +       }
> +       read_extent_buffer(leaf, &cache->item,
> +                          btrfs_item_ptr_offset(leaf, slot),
> +                          sizeof(cache->item));
> +       cache->flags = btrfs_block_group_flags(&cache->item);
> +       if (!mixed && ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
> +           (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
> +                       btrfs_err(info,
> +"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
> +                                 cache->key.objectid);
> +                       ret = -EINVAL;
> +                       goto error;
> +       }
> +
> +       /*
> +        * We need to exclude the super stripes now so that the space info has
> +        * super bytes accounted for, otherwise we'll think we have more space
> +        * than we actually do.
> +        */
> +       ret = exclude_super_stripes(cache);
> +       if (ret) {
> +               /* We may have excluded something, so call this just in case. */
> +               btrfs_free_excluded_extents(cache);
> +               goto error;
> +       }
> +
> +       /*
> +        * Check for two cases, either we are full, and therefore don't need
> +        * to bother with the caching work since we won't find any space, or we
> +        * are empty, and we can just add all the space in and be done with it.
> +        * This saves us _a_lot_ of time, particularly in the full case.
> +        */
> +       if (key.offset == btrfs_block_group_used(&cache->item)) {
> +               cache->last_byte_to_unpin = (u64)-1;
> +               cache->cached = BTRFS_CACHE_FINISHED;
> +               btrfs_free_excluded_extents(cache);
> +       } else if (btrfs_block_group_used(&cache->item) == 0) {
> +               cache->last_byte_to_unpin = (u64)-1;
> +               cache->cached = BTRFS_CACHE_FINISHED;
> +               add_new_free_space(cache, key.objectid,
> +                                  key.objectid + key.offset);
> +               btrfs_free_excluded_extents(cache);
> +       }
> +       ret = btrfs_add_block_group_cache(info, cache);
> +       if (ret) {
> +               btrfs_remove_free_space_cache(cache);
> +               goto error;
> +       }
> +       trace_btrfs_add_block_group(info, cache, 0);
> +       btrfs_update_space_info(info, cache->flags, key.offset,
> +                               btrfs_block_group_used(&cache->item),
> +                               cache->bytes_super, &space_info);
> +
> +       cache->space_info = space_info;
> +
> +       link_block_group(cache);
> +
> +       set_avail_alloc_bits(info, cache->flags);
> +       if (btrfs_chunk_readonly(info, cache->key.objectid)) {
> +               inc_block_group_ro(cache, 1);
> +       } else if (btrfs_block_group_used(&cache->item) == 0) {
> +               ASSERT(list_empty(&cache->bg_list));
> +               btrfs_mark_bg_unused(cache);
> +       }
> +       return 0;
> +error:
> +       btrfs_put_block_group(cache);
> +       return ret;
> +}
> +
>  int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  {
>         struct btrfs_path *path;
> @@ -1694,15 +1797,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>         struct btrfs_block_group_cache *cache;
>         struct btrfs_space_info *space_info;
>         struct btrfs_key key;
> -       struct btrfs_key found_key;
> -       struct extent_buffer *leaf;
>         int need_clear = 0;
>         u64 cache_gen;
> -       u64 feature;
> -       int mixed;
> -
> -       feature = btrfs_super_incompat_flags(info->super_copy);
> -       mixed = !!(feature & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS);
>
>         key.objectid = 0;
>         key.offset = 0;
> @@ -1726,107 +1822,13 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>                 if (ret != 0)
>                         goto error;
>
> -               leaf = path->nodes[0];
> -               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> -
> -               cache = btrfs_create_block_group_cache(info, found_key.objectid,
> -                                                      found_key.offset);
> -               if (!cache) {
> -                       ret = -ENOMEM;
> -                       goto error;
> -               }
> -
> -               if (need_clear) {
> -                       /*
> -                        * When we mount with old space cache, we need to
> -                        * set BTRFS_DC_CLEAR and set dirty flag.
> -                        *
> -                        * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
> -                        *    truncate the old free space cache inode and
> -                        *    setup a new one.
> -                        * b) Setting 'dirty flag' makes sure that we flush
> -                        *    the new space cache info onto disk.
> -                        */
> -                       if (btrfs_test_opt(info, SPACE_CACHE))
> -                               cache->disk_cache_state = BTRFS_DC_CLEAR;
> -               }
> -
> -               read_extent_buffer(leaf, &cache->item,
> -                                  btrfs_item_ptr_offset(leaf, path->slots[0]),
> -                                  sizeof(cache->item));
> -               cache->flags = btrfs_block_group_flags(&cache->item);
> -               if (!mixed &&
> -                   ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
> -                   (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
> -                       btrfs_err(info,
> -"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
> -                                 cache->key.objectid);
> -                       ret = -EINVAL;
> +               btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +               ret = read_one_block_group(info, path, need_clear);
> +               if (ret < 0)
>                         goto error;
> -               }
> -
> -               key.objectid = found_key.objectid + found_key.offset;
> +               key.objectid += key.offset;
> +               key.offset = 0;
>                 btrfs_release_path(path);
> -
> -               /*
> -                * We need to exclude the super stripes now so that the space
> -                * info has super bytes accounted for, otherwise we'll think
> -                * we have more space than we actually do.
> -                */
> -               ret = exclude_super_stripes(cache);
> -               if (ret) {
> -                       /*
> -                        * We may have excluded something, so call this just in
> -                        * case.
> -                        */
> -                       btrfs_free_excluded_extents(cache);
> -                       btrfs_put_block_group(cache);
> -                       goto error;
> -               }
> -
> -               /*
> -                * Check for two cases, either we are full, and therefore
> -                * don't need to bother with the caching work since we won't
> -                * find any space, or we are empty, and we can just add all
> -                * the space in and be done with it.  This saves us _a_lot_ of
> -                * time, particularly in the full case.
> -                */
> -               if (found_key.offset == btrfs_block_group_used(&cache->item)) {
> -                       cache->last_byte_to_unpin = (u64)-1;
> -                       cache->cached = BTRFS_CACHE_FINISHED;
> -                       btrfs_free_excluded_extents(cache);
> -               } else if (btrfs_block_group_used(&cache->item) == 0) {
> -                       cache->last_byte_to_unpin = (u64)-1;
> -                       cache->cached = BTRFS_CACHE_FINISHED;
> -                       add_new_free_space(cache, found_key.objectid,
> -                                          found_key.objectid +
> -                                          found_key.offset);
> -                       btrfs_free_excluded_extents(cache);
> -               }
> -
> -               ret = btrfs_add_block_group_cache(info, cache);
> -               if (ret) {
> -                       btrfs_remove_free_space_cache(cache);
> -                       btrfs_put_block_group(cache);
> -                       goto error;
> -               }
> -
> -               trace_btrfs_add_block_group(info, cache, 0);
> -               btrfs_update_space_info(info, cache->flags, found_key.offset,
> -                                       btrfs_block_group_used(&cache->item),
> -                                       cache->bytes_super, &space_info);
> -
> -               cache->space_info = space_info;
> -
> -               link_block_group(cache);
> -
> -               set_avail_alloc_bits(info, cache->flags);
> -               if (btrfs_chunk_readonly(info, cache->key.objectid)) {
> -                       inc_block_group_ro(cache, 1);
> -               } else if (btrfs_block_group_used(&cache->item) == 0) {
> -                       ASSERT(list_empty(&cache->bg_list));
> -                       btrfs_mark_bg_unused(cache);
> -               }
>         }
>
>         list_for_each_entry_rcu(space_info, &info->space_info, list) {
> --
> 2.23.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root
  2019-10-08  4:49 ` [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root Qu Wenruo
  2019-10-08  8:30   ` Johannes Thumshirn
@ 2019-10-10  2:00   ` Anand Jain
  2019-10-10  2:21     ` Qu Wenruo
  1 sibling, 1 reply; 19+ messages in thread
From: Anand Jain @ 2019-10-10  2:00 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 8/10/19 12:49 PM, Qu Wenruo wrote:
> In free_root_pointers(), free_root_extent_buffers() has checked if the
> @root parameter is NULL.
> So there is no need checking chunk_root before freeing it.

um..

> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/disk-io.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 044981cf6df9..bfeeac83b952 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2038,8 +2038,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root)
>   	free_root_extent_buffers(info->csum_root);
>   	free_root_extent_buffers(info->quota_root);
>   	free_root_extent_buffers(info->uuid_root);
> -	if (chunk_root)
> -		free_root_extent_buffers(info->chunk_root);
> +	free_root_extent_buffers(info->chunk_root);
>   	free_root_extent_buffers(info->free_space_root);
>   }
>   
> 

This will cause the regression and fails mount from the backup root.

We have %recovery_tree_root: which shall re-read all the trees root
except the chunk root.

I don't think your idea was to re-read the chunk root as well?.

-----------------
recovery_tree_root:
         if (!btrfs_test_opt(fs_info, USEBACKUPROOT))
                 goto fail_tree_roots;

         free_root_pointers(fs_info, 0);
-----------------

Thanks, Anand

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root
  2019-10-10  2:00   ` Anand Jain
@ 2019-10-10  2:21     ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2019-10-10  2:21 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1739 bytes --]



On 2019/10/10 上午10:00, Anand Jain wrote:
> On 8/10/19 12:49 PM, Qu Wenruo wrote:
>> In free_root_pointers(), free_root_extent_buffers() has checked if the
>> @root parameter is NULL.
>> So there is no need checking chunk_root before freeing it.
> 
> um..

Oh, my bad, I get confused with the parameter @chunk_root against
@fs_info->chunk_root.

So please discard the patch.

Thanks,
Qu

> 
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/disk-io.c | 3 +--
>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 044981cf6df9..bfeeac83b952 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2038,8 +2038,7 @@ static void free_root_pointers(struct
>> btrfs_fs_info *info, int chunk_root)
>>       free_root_extent_buffers(info->csum_root);
>>       free_root_extent_buffers(info->quota_root);
>>       free_root_extent_buffers(info->uuid_root);
>> -    if (chunk_root)
>> -        free_root_extent_buffers(info->chunk_root);
>> +    free_root_extent_buffers(info->chunk_root);
>>       free_root_extent_buffers(info->free_space_root);
>>   }
>>  
> 
> This will cause the regression and fails mount from the backup root.
> 
> We have %recovery_tree_root: which shall re-read all the trees root
> except the chunk root.
> 
> I don't think your idea was to re-read the chunk root as well?.
> 
> -----------------
> recovery_tree_root:
>         if (!btrfs_test_opt(fs_info, USEBACKUPROOT))
>                 goto fail_tree_roots;
> 
>         free_root_pointers(fs_info, 0);
> -----------------
> 
> Thanks, Anand


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08  4:49 [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time Qu Wenruo
2019-10-08  4:49 ` [PATCH v2 1/3] btrfs: block-group: Refactor btrfs_read_block_groups() Qu Wenruo
2019-10-08  9:08   ` Johannes Thumshirn
2019-10-09 12:08   ` Anand Jain
2019-10-09 12:14     ` Qu WenRuo
2019-10-09 13:07   ` [PATCH " Qu Wenruo
2019-10-09 14:25     ` Filipe Manana
2019-10-08  4:49 ` [PATCH v2 2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root Qu Wenruo
2019-10-08  8:30   ` Johannes Thumshirn
2019-10-10  2:00   ` Anand Jain
2019-10-10  2:21     ` Qu Wenruo
2019-10-08  4:49 ` [PATCH v2 3/3] btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time Qu Wenruo
2019-10-08  9:09   ` Johannes Thumshirn
2019-10-08  9:14 ` [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce " Johannes Thumshirn
2019-10-08  9:26   ` Qu Wenruo
2019-10-08  9:47     ` Johannes Thumshirn
     [not found]       ` <b4821d86-eeb9-f21c-66aa-480df2d3a13d@feldspaten.org>
2019-10-09  7:43         ` Qu Wenruo
2019-10-09  8:08           ` Felix Niederwanger
2019-10-09 11:00             ` Qu Wenruo

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox