All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time
@ 2020-05-04 23:58 Qu Wenruo
  2020-05-04 23:58 ` [PATCH v4 1/7] btrfs: block-group: Don't set the wrong READA flag for btrfs_read_block_groups() Qu Wenruo
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:58 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from:
https://github.com/adam900710/linux/tree/skinny_bg_tree
Which is based on david/misc-next branch.

This patchset will hugely reduce mount time of large fs by putting all
block group items into its own tree, and further compact the block group
item design to take full usage of btrfs_key.

The old behavior will try to read out all block group items at mount
time, however due to the key of block group items are scattered across
tons of extent items, we must call btrfs_search_slot() for each block
group.

It works fine for small fs, but when number of block groups goes beyond
200, such tree search will become a random read, causing obvious slow
down.

On the other hand, btrfs_read_chunk_tree() is still very fast, since we
put CHUNK_ITEMS into their own tree and package them next to each other.

Following this idea, we could do the same thing for block group items,
so instead of triggering btrfs_search_slot() for each block group, we
just call btrfs_next_item() and under most case we could finish in
memory, and hugely speed up mount (see BENCHMARK below).

The only disadvantage is, this method introduce an incompatible feature,
so existing fs can't use this feature directly.
This can be improved to RO compatible, as long as btrfs can go skip_bg
automatically (another patchset needed)

Either specify it at mkfs time, or use btrfs-progs offline convert tool.

[[Benchmark]]
Since I have upgraded my rig to all NVME storage, there is no HDD
test result.

Physical device:	NVMe SSD
VM device:		VirtIO block device, backup by sparse file
Nodesize:		4K  (to bump up tree height)
Extent data size:	4M
Fs size used:		1T

All file extents on disk is in 4M size, preallocated to reduce space usage
(as the VM uses loopback block device backed by sparse file)

Without patchset:
Use ftrace function graph:

 7)               |  open_ctree [btrfs]() {
 7)               |    btrfs_read_block_groups [btrfs]() {
 7) @ 805851.8 us |    }
 7) @ 911890.2 us |  }

 btrfs_read_block_groups() takes 88% of the total mount time,

With patchset, and use -O skinny-bg-tree mkfs option:

 5)               |  open_ctree [btrfs]() {
 5)               |    btrfs_read_block_groups [btrfs]() {
 5) * 63395.75 us |    }
 5) @ 143106.9 us |  }

  open_ctree() time is only 15% of original mount time.
  And btrfs_read_block_groups() only takes 7% of total open_ctree()
  execution time.

The reason is pretty obvious when considering how many tree blocks needs
to be read from disk:

          |  Extent tree  |  Skinny bg tree  |
----------+---------------+-------------------
  nodes   |            55 |                1 |
  leaves  |          1025 |                7 |
  total   |          1080 |                8 |
Not to mention all the tree blocks readahead works pretty fine for bg
tree, as we will read every item.
While readahead for extent tree will just be a diaster, as all block
groups are scatter across the whole extent tree.

Changelog:
(v2~v3 are all original bg-tree design)
v2:
- Rebase to v5.4-rc1
  Minor conflicts due to code moved to block-group.c
- Fix a bug where some block groups will not be loaded at mount time
  It's a bug in that refactor patch, not exposed by previous round of
  tests.
- Add a new patch to remove a dead check
- Update benchmark to NVMe based result
  Hardware upgrade is not always a good thing for benchmark.

v3:
- Add a separate patch to fix possible memory leak
- Add Reviewed-by tag for the refactor patch
- Reword the refactor patch to mention the change of use
  btrfs_fs_incompat()

v4:
- Move to skinny bg tree design
  Since we're going to introduce an incompatible feature, it's better to
  do it to extreme.

- Refactor block group item operations into their own functions
  So that block group item operations would hide the skinny bg item and
  regular bg item operations from callers, making code much easier to
  read.

- Add tree checker patch

Qu Wenruo (7):
  btrfs: block-group: Don't set the wrong READA flag for
    btrfs_read_block_groups()
  btrfs: block-group: Refactor how we read one block group item
  btrfs: block-group: Refactor how we delete one block group item
  btrfs: block-group: Refactor how we insert a block group item
  btrfs: block-group: Rename write_one_cahce_group()
  btrfs: Introduce new incompat feature, SKINNY_BG_TREE, to hugely
    reduce mount time
  btrfs: tree-checker: Introduce checks for skinny block group item

 fs/btrfs/block-group.c          | 317 +++++++++++++++++++++++++-------
 fs/btrfs/block-rsv.c            |   2 +
 fs/btrfs/ctree.h                |   4 +-
 fs/btrfs/disk-io.c              |  18 +-
 fs/btrfs/sysfs.c                |   2 +
 fs/btrfs/tree-checker.c         |  56 +++++-
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  11 ++
 8 files changed, 335 insertions(+), 76 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/7] btrfs: block-group: Don't set the wrong READA flag for btrfs_read_block_groups()
  2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
@ 2020-05-04 23:58 ` Qu Wenruo
  2020-05-04 23:58 ` [PATCH v4 2/7] btrfs: block-group: Refactor how we read one block group item Qu Wenruo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:58 UTC (permalink / raw)
  To: linux-btrfs

For regular block group items in extent tree, they are scattered inside
the huge tree, thus forward readahead makes no sense.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5627f53ca115..42dae6473354 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1985,7 +1985,6 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->reada = READA_FORWARD;
 
 	cache_gen = btrfs_super_cache_generation(info->super_copy);
 	if (btrfs_test_opt(info, SPACE_CACHE) &&
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/7] btrfs: block-group: Refactor how we read one block group item
  2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
  2020-05-04 23:58 ` [PATCH v4 1/7] btrfs: block-group: Don't set the wrong READA flag for btrfs_read_block_groups() Qu Wenruo
@ 2020-05-04 23:58 ` Qu Wenruo
  2020-05-04 23:58 ` [PATCH v4 3/7] btrfs: block-group: Refactor how we delete " Qu Wenruo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:58 UTC (permalink / raw)
  To: linux-btrfs

Structure btrfs_block_group has the following members which are
currently read from on-disk block group item and key:
- Length
  From item key.
- Used
- Flags
  From block group item.

However for incoming skinny block group tree, we are going to read those
members from different sources.

This patch will refactor such read by:
- Don't initialize btrfs_block_group::length at allocation
  Caller should initialize them manually.
  Also to avoid possible (well, only two callers) missing
  initialization, add extra ASSERT() in btrfs_add_block_group_cache().

- Refactor length/used/flags initialization into one function
  The new function, fill_one_block_group() will handle the
  initialization of such members.

- Use btrfs_block_group::length to replace key::offset
  Since skinny block group item would have a different meaning for its
  key offset.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 47 ++++++++++++++++++++++++++++--------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 42dae6473354..57483ea07d51 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -161,6 +161,8 @@ static int btrfs_add_block_group_cache(struct btrfs_fs_info *info,
 	struct rb_node *parent = NULL;
 	struct btrfs_block_group *cache;
 
+	ASSERT(block_group->length != 0);
+
 	spin_lock(&info->block_group_cache_lock);
 	p = &info->block_group_cache_tree.rb_node;
 
@@ -1768,7 +1770,7 @@ static void link_block_group(struct btrfs_block_group *cache)
 }
 
 static struct btrfs_block_group *btrfs_create_block_group_cache(
-		struct btrfs_fs_info *fs_info, u64 start, u64 size)
+		struct btrfs_fs_info *fs_info, u64 start)
 {
 	struct btrfs_block_group *cache;
 
@@ -1784,7 +1786,6 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
 	}
 
 	cache->start = start;
-	cache->length = size;
 
 	cache->fs_info = fs_info;
 	cache->full_stripe_len = btrfs_full_stripe_len(fs_info, start);
@@ -1864,25 +1865,44 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
+static int read_block_group_item(struct btrfs_block_group *cache,
+				 struct btrfs_path *path,
+				 const struct btrfs_key *key)
+{
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_block_group_item bgi;
+	int slot = path->slots[0];
+
+	cache->length = key->offset;
+
+	read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
+			sizeof(bgi));
+	cache->used = btrfs_stack_block_group_used(&bgi);
+	cache->flags = btrfs_stack_block_group_flags(&bgi);
+
+	return 0;
+}
+
 static int read_one_block_group(struct btrfs_fs_info *info,
 				struct btrfs_path *path,
 				const struct btrfs_key *key,
 				int need_clear)
 {
-	struct extent_buffer *leaf = path->nodes[0];
 	struct btrfs_block_group *cache;
 	struct btrfs_space_info *space_info;
-	struct btrfs_block_group_item bgi;
 	const bool mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
-	int slot = path->slots[0];
 	int ret;
 
 	ASSERT(key->type == BTRFS_BLOCK_GROUP_ITEM_KEY);
 
-	cache = btrfs_create_block_group_cache(info, key->objectid, key->offset);
+	cache = btrfs_create_block_group_cache(info, key->objectid);
 	if (!cache)
 		return -ENOMEM;
 
+	ret = read_block_group_item(cache, path, key);
+	if (ret < 0)
+		goto error;
+
 	if (need_clear) {
 		/*
 		 * When we mount with old space cache, we need to
@@ -1897,10 +1917,6 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		if (btrfs_test_opt(info, SPACE_CACHE))
 			cache->disk_cache_state = BTRFS_DC_CLEAR;
 	}
-	read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
-			   sizeof(bgi));
-	cache->used = btrfs_stack_block_group_used(&bgi);
-	cache->flags = btrfs_stack_block_group_flags(&bgi);
 	if (!mixed && ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
 	    (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
 			btrfs_err(info,
@@ -1928,15 +1944,15 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	 * are empty, and we can just add all the space in and be done with it.
 	 * This saves us _a_lot_ of time, particularly in the full case.
 	 */
-	if (key->offset == cache->used) {
+	if (cache->length == cache->used) {
 		cache->last_byte_to_unpin = (u64)-1;
 		cache->cached = BTRFS_CACHE_FINISHED;
 		btrfs_free_excluded_extents(cache);
 	} else if (cache->used == 0) {
 		cache->last_byte_to_unpin = (u64)-1;
 		cache->cached = BTRFS_CACHE_FINISHED;
-		add_new_free_space(cache, key->objectid,
-				   key->objectid + key->offset);
+		add_new_free_space(cache, cache->start,
+				   cache->start + cache->length);
 		btrfs_free_excluded_extents(cache);
 	}
 
@@ -1946,7 +1962,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		goto error;
 	}
 	trace_btrfs_add_block_group(info, cache, 0);
-	btrfs_update_space_info(info, cache->flags, key->offset,
+	btrfs_update_space_info(info, cache->flags, cache->length,
 				cache->used, cache->bytes_super, &space_info);
 
 	cache->space_info = space_info;
@@ -2093,10 +2109,11 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 
 	btrfs_set_log_full_commit(trans);
 
-	cache = btrfs_create_block_group_cache(fs_info, chunk_offset, size);
+	cache = btrfs_create_block_group_cache(fs_info, chunk_offset);
 	if (!cache)
 		return -ENOMEM;
 
+	cache->length = size;
 	cache->used = bytes_used;
 	cache->flags = type;
 	cache->last_byte_to_unpin = (u64)-1;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/7] btrfs: block-group: Refactor how we delete one block group item
  2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
  2020-05-04 23:58 ` [PATCH v4 1/7] btrfs: block-group: Don't set the wrong READA flag for btrfs_read_block_groups() Qu Wenruo
  2020-05-04 23:58 ` [PATCH v4 2/7] btrfs: block-group: Refactor how we read one block group item Qu Wenruo
@ 2020-05-04 23:58 ` Qu Wenruo
  2020-05-05  8:47   ` Johannes Thumshirn
  2020-05-04 23:58 ` [PATCH v4 4/7] btrfs: block-group: Refactor how we insert a " Qu Wenruo
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:58 UTC (permalink / raw)
  To: linux-btrfs

When deleting a block group item, it's pretty straight forward, just
delete the item pointed by the key.

However it will not be that straight-forward for incoming skinny block
group item.

So refactor the block group item deletion into a new function,
remove_block_group_item(), also to make the already lengthy
btrfs_remove_block_group() a little shorter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 57483ea07d51..914b1c2064ac 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -865,11 +865,34 @@ static void clear_incompat_bg_bits(struct btrfs_fs_info *fs_info, u64 flags)
 	}
 }
 
+static int remove_block_group_item(struct btrfs_trans_handle *trans,
+				   struct btrfs_path *path,
+				   struct btrfs_block_group *block_group)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_root *root;
+	struct btrfs_key key;
+	int ret;
+
+	root = fs_info->extent_root;
+	key.objectid = block_group->start;
+	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+	key.offset = block_group->length;
+
+	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+	if (ret > 0)
+		ret = -ENOENT;
+	if (ret < 0)
+		return ret;
+
+	ret = btrfs_del_item(trans, root, path);
+	return ret;
+}
+
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     u64 group_start, struct extent_map *em)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_root *root = fs_info->extent_root;
 	struct btrfs_path *path;
 	struct btrfs_block_group *block_group;
 	struct btrfs_free_cluster *cluster;
@@ -1067,10 +1090,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	spin_unlock(&block_group->space_info->lock);
 
-	key.objectid = block_group->start;
-	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
-	key.offset = block_group->length;
-
 	mutex_lock(&fs_info->chunk_mutex);
 	spin_lock(&block_group->lock);
 	block_group->removed = 1;
@@ -1109,16 +1128,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	/* Once for the block groups rbtree */
 	btrfs_put_block_group(block_group);
 
-	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
-	if (ret > 0)
-		ret = -EIO;
+	ret = remove_block_group_item(trans, path, block_group);
 	if (ret < 0)
 		goto out;
 
-	ret = btrfs_del_item(trans, root, path);
-	if (ret)
-		goto out;
-
 	if (remove_em) {
 		struct extent_map_tree *em_tree;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 4/7] btrfs: block-group: Refactor how we insert a block group item
  2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
                   ` (2 preceding siblings ...)
  2020-05-04 23:58 ` [PATCH v4 3/7] btrfs: block-group: Refactor how we delete " Qu Wenruo
@ 2020-05-04 23:58 ` Qu Wenruo
  2020-05-05  8:59   ` Johannes Thumshirn
  2020-05-04 23:58 ` [PATCH v4 5/7] btrfs: block-group: Rename write_one_cahce_group() Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:58 UTC (permalink / raw)
  To: linux-btrfs

Currently the block group item insert is pretty straight forward, fill
the block group item structure and insert it into extent tree.

However the incoming skinny block group feature is going to change this,
so this patch will refactor such insert into a new function,
insert_block_group_item(), to make the incoming feature easier to add.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 41 +++++++++++++++++++++++++----------------
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 914b1c2064ac..93e7835ca79d 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2068,13 +2068,32 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	return ret;
 }
 
+static int insert_block_group_item(struct btrfs_trans_handle *trans,
+				   struct btrfs_block_group *block_group)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_block_group_item bgi;
+	struct btrfs_root *root;
+	struct btrfs_key key;
+
+	spin_lock(&block_group->lock);
+	btrfs_set_stack_block_group_used(&bgi, block_group->used);
+	btrfs_set_stack_block_group_chunk_objectid(&bgi,
+				BTRFS_FIRST_CHUNK_TREE_OBJECTID);
+	btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
+	key.objectid = block_group->start;
+	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+	key.offset = block_group->length;
+	spin_unlock(&block_group->lock);
+
+	root = fs_info->extent_root;
+	return btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
+}
+
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_block_group *block_group;
-	struct btrfs_root *extent_root = fs_info->extent_root;
-	struct btrfs_block_group_item item;
-	struct btrfs_key key;
 	int ret = 0;
 
 	if (!trans->can_flush_pending_bgs)
@@ -2087,21 +2106,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		if (ret)
 			goto next;
 
-		spin_lock(&block_group->lock);
-		btrfs_set_stack_block_group_used(&item, block_group->used);
-		btrfs_set_stack_block_group_chunk_objectid(&item,
-				BTRFS_FIRST_CHUNK_TREE_OBJECTID);
-		btrfs_set_stack_block_group_flags(&item, block_group->flags);
-		key.objectid = block_group->start;
-		key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
-		key.offset = block_group->length;
-		spin_unlock(&block_group->lock);
-
-		ret = btrfs_insert_item(trans, extent_root, &key, &item,
-					sizeof(item));
+		ret = insert_block_group_item(trans, block_group);
 		if (ret)
 			btrfs_abort_transaction(trans, ret);
-		ret = btrfs_finish_chunk_alloc(trans, key.objectid, key.offset);
+		ret = btrfs_finish_chunk_alloc(trans, block_group->start,
+					block_group->length);
 		if (ret)
 			btrfs_abort_transaction(trans, ret);
 		add_block_group_free_space(trans, block_group);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 5/7] btrfs: block-group: Rename write_one_cahce_group()
  2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
                   ` (3 preceding siblings ...)
  2020-05-04 23:58 ` [PATCH v4 4/7] btrfs: block-group: Refactor how we insert a " Qu Wenruo
@ 2020-05-04 23:58 ` Qu Wenruo
  2020-05-05  9:11   ` Johannes Thumshirn
  2020-05-11 19:19   ` David Sterba
  2020-05-04 23:58 ` [PATCH v4 6/7] btrfs: Introduce new incompat feature, SKINNY_BG_TREE, to hugely reduce mount time Qu Wenruo
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 16+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:58 UTC (permalink / raw)
  To: linux-btrfs

The name of this function contains the word "cache", which is left from
the era where btrfs_block_group is called btrfs_block_group_cache.

Now this "cache" doesn't match any thing, and we have better namings for
functions like read/insert/remove_block_group_item().

So rename this function to update_block_group_item().

Since we're here, also rename the local variables to be more like a
Chrismas tree, and rename @extent_root to @root for later reuse.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 93e7835ca79d..46846749b631 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2346,23 +2346,24 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group *cache)
 	spin_unlock(&sinfo->lock);
 }
 
-static int write_one_cache_group(struct btrfs_trans_handle *trans,
-				 struct btrfs_path *path,
-				 struct btrfs_block_group *cache)
+static int update_block_group_item(struct btrfs_trans_handle *trans,
+				   struct btrfs_path *path,
+				   struct btrfs_block_group *cache)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	int ret;
-	struct btrfs_root *extent_root = fs_info->extent_root;
-	unsigned long bi;
-	struct extent_buffer *leaf;
 	struct btrfs_block_group_item bgi;
+	struct extent_buffer *leaf;
+	struct btrfs_root *root;
 	struct btrfs_key key;
+	unsigned long bi;
+	int ret;
 
 	key.objectid = cache->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 	key.offset = cache->length;
+	root = fs_info->extent_root;
 
-	ret = btrfs_search_slot(trans, extent_root, &key, path, 0, 1);
+	ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
 	if (ret) {
 		if (ret > 0)
 			ret = -ENOENT;
@@ -2674,7 +2675,7 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
 			}
 		}
 		if (!ret) {
-			ret = write_one_cache_group(trans, path, cache);
+			ret = update_block_group_item(trans, path, cache);
 			/*
 			 * Our block group might still be attached to the list
 			 * of new block groups in the transaction handle of some
@@ -2823,7 +2824,7 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans)
 			}
 		}
 		if (!ret) {
-			ret = write_one_cache_group(trans, path, cache);
+			ret = update_block_group_item(trans, path, cache);
 			/*
 			 * One of the free space endio workers might have
 			 * created a new block group while updating a free space
@@ -2840,7 +2841,7 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans)
 			if (ret == -ENOENT) {
 				wait_event(cur_trans->writer_wait,
 				   atomic_read(&cur_trans->num_writers) == 1);
-				ret = write_one_cache_group(trans, path, cache);
+				ret = update_block_group_item(trans, path, cache);
 			}
 			if (ret)
 				btrfs_abort_transaction(trans, ret);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 6/7] btrfs: Introduce new incompat feature, SKINNY_BG_TREE, to hugely reduce mount time
  2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
                   ` (4 preceding siblings ...)
  2020-05-04 23:58 ` [PATCH v4 5/7] btrfs: block-group: Rename write_one_cahce_group() Qu Wenruo
@ 2020-05-04 23:58 ` Qu Wenruo
  2020-05-05 19:10   ` Johannes Thumshirn
  2020-05-04 23:58 ` [PATCH v4 7/7] btrfs: tree-checker: Introduce checks for skinny block group item Qu Wenruo
  2020-05-11 19:21 ` [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time David Sterba
  7 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:58 UTC (permalink / raw)
  To: linux-btrfs

The overall idea of the new SKINNY_BG_TREE is:
- Put block group items into a separate tree
- Reduce the size of block group item to minimal

This brings one obvious enhancement:
- Reduce mount time of large fs to minimal

With all these good news, there are still something we need extra
attention:
- Need extra chunk lookup when read block group
  To get the block group length and block group flags, but that's all
  in-memory lookup, thus way faster than regular tree search.

- We can no longer rely on key->offset for various block group context
  Only key->objectid is still the same.
  The good news is, in previous refactors, we have already killed such
  call sites.

For now, if an existing fs want to take advantage of BG_TREE feature,
btrfs-progs will provide offline convertion tool.

[[Benchmark]]
Physical device:	NVMe SSD
VM device:		VirtIO block device, backup by sparse file
Nodesize:		4K  (to bump up tree height)
Extent data size:	4M
Fs size used:		1T

All file extents on disk is in 4M size, preallocated to reduce space usage
(as the VM uses loopback block device backed by sparse file)

Without patchset:
Use ftrace function graph:

 7)               |  open_ctree [btrfs]() {
 7)               |    btrfs_read_block_groups [btrfs]() {
 7) @ 805851.8 us |    }
 7) @ 911890.2 us |  }

 btrfs_read_block_groups() takes 88% of the total mount time,

With patchset, and use -O bg-tree mkfs option:

 5)               |  open_ctree [btrfs]() {
 5)               |    btrfs_read_block_groups [btrfs]() {
 5) * 63395.75 us |    }
 5) @ 143106.9 us |  }

  open_ctree() time is only 15% of original mount time.
  And btrfs_read_block_groups() only takes 7% of total open_ctree()
  execution time.

The reason is pretty obvious when considering how many tree blocks needs
to be read from disk:

          |  Extent tree  |  Skinny bg tree  |
----------------------------------------------
  nodes   |            55 |                1 |
  leaves  |          1025 |                7 |
  total   |          1080 |                8 |

The skinny bg tree reduces half of its tree blocks compared to regular
bg tree, less than 1% of extent tree implementation.

Not to mention all the tree blocks readahead works pretty fine for
regular and skinny bg tree, as we will read every item.
While readahead for extent tree will just be a diaster, as all block
groups are scatter across the whole extent tree.

The reduction of mount time is already obvious even on super fast NVMe
disk with memory cache.
It would be even more obvious if the fs is on spinning rust.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c          | 168 +++++++++++++++++++++++++++++---
 fs/btrfs/block-rsv.c            |   2 +
 fs/btrfs/ctree.h                |   4 +-
 fs/btrfs/disk-io.c              |  18 +++-
 fs/btrfs/sysfs.c                |   2 +
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  11 +++
 7 files changed, 190 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 46846749b631..fad9118c2c1e 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -874,6 +874,28 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
 	struct btrfs_key key;
 	int ret;
 
+	if (btrfs_fs_incompat(fs_info, SKINNY_BG_TREE)) {
+		root = fs_info->bg_root;
+		key.objectid = block_group->start;
+		key.type = BTRFS_SKINNY_BLOCK_GROUP_ITEM_KEY;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+		if (ret == 0) {
+			btrfs_release_path(path);
+			return -EUCLEAN;
+		}
+		if (ret < 0)
+			return ret;
+		ret = btrfs_previous_item(root, path, key.objectid, key.type);
+		if (ret > 0)
+			ret = -ENOENT;
+		if (ret < 0)
+			return ret;
+		ret = btrfs_del_item(trans, root, path);
+		return ret;
+	}
+
 	root = fs_info->extent_root;
 	key.objectid = block_group->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
@@ -1883,9 +1905,38 @@ static int read_block_group_item(struct btrfs_block_group *cache,
 				 const struct btrfs_key *key)
 {
 	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_fs_info *fs_info = leaf->fs_info;
 	struct btrfs_block_group_item bgi;
 	int slot = path->slots[0];
 
+	if (btrfs_fs_incompat(fs_info, SKINNY_BG_TREE)) {
+		struct extent_map *em;
+
+		ASSERT(key->type == BTRFS_SKINNY_BLOCK_GROUP_ITEM_KEY);
+		em = btrfs_get_chunk_map(fs_info, key->objectid, 1);
+		if (IS_ERR(em))
+			return PTR_ERR(em);
+		if (em->start != key->objectid) {
+			btrfs_err(fs_info, "no chunk found for bytenr %llu",
+				key->objectid);
+			free_extent_map(em);
+			return -EUCLEAN;
+		}
+		/* key->offset is used space */
+		if (key->offset > em->len) {
+			btrfs_err(fs_info,
+				"invalid bg used, have %llu expect [0, %llu]",
+				  key->offset, em->len);
+			free_extent_map(em);
+			return -EUCLEAN;
+		}
+		cache->length = em->len;
+		cache->flags = em->map_lookup->type;
+		cache->used = key->offset;
+		free_extent_map(em);
+		return 0;
+	}
+	ASSERT(key->type == BTRFS_BLOCK_GROUP_ITEM_KEY);
 	cache->length = key->offset;
 
 	read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
@@ -1906,7 +1957,10 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	const bool mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
 	int ret;
 
-	ASSERT(key->type == BTRFS_BLOCK_GROUP_ITEM_KEY);
+	ASSERT((!btrfs_fs_incompat(info, SKINNY_BG_TREE) &&
+		key->type == BTRFS_BLOCK_GROUP_ITEM_KEY) ||
+	       (btrfs_fs_incompat(info, SKINNY_BG_TREE) &&
+		key->type == BTRFS_SKINNY_BLOCK_GROUP_ITEM_KEY));
 
 	cache = btrfs_create_block_group_cache(info, key->objectid);
 	if (!cache)
@@ -1998,6 +2052,54 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	return ret;
 }
 
+static int read_skinny_block_groups(struct btrfs_fs_info *fs_info,
+				    struct btrfs_path *path,
+				    int need_clear)
+{
+	struct btrfs_root *root = fs_info->bg_root;
+	struct btrfs_key key;
+	int ret;
+
+	key.objectid = 0;
+	key.type = 0;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		return ret;
+	if (ret == 0) {
+		btrfs_err(fs_info,
+			  "found invalid key (0, 0, 0) in block group tree");
+		ret = -EUCLEAN;
+		goto out;
+	}
+
+	while (1) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (key.type != BTRFS_SKINNY_BLOCK_GROUP_ITEM_KEY) {
+			btrfs_err(fs_info,
+		"found invalid key (%llu, %u, %llu) in block group tree",
+				  key.objectid, key.type, key.offset);
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		ret = read_one_block_group(fs_info, path, &key, need_clear);
+		if (ret < 0)
+			goto out;
+		ret = btrfs_next_item(root, path);
+		if (ret < 0)
+			goto out;
+		if (ret > 0) {
+			ret = 0;
+			goto out;
+		}
+	}
+out:
+	btrfs_release_path(path);
+	return ret;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -2022,20 +2124,27 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	if (btrfs_test_opt(info, CLEAR_CACHE))
 		need_clear = 1;
 
-	while (1) {
-		ret = find_first_block_group(info, path, &key);
-		if (ret > 0)
-			break;
-		if (ret != 0)
-			goto error;
+	if (btrfs_fs_incompat(info, SKINNY_BG_TREE)) {
+		path->reada = READA_FORWARD;
+		ret = read_skinny_block_groups(info, path, need_clear);
+	} else {
+		while (1) {
+			ret = find_first_block_group(info, path, &key);
+			if (ret > 0)
+				break;
+			if (ret != 0)
+				goto error;
 
-		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-		ret = read_one_block_group(info, path, &key, need_clear);
-		if (ret < 0)
-			goto error;
-		key.objectid += key.offset;
-		key.offset = 0;
-		btrfs_release_path(path);
+			btrfs_item_key_to_cpu(path->nodes[0], &key,
+						path->slots[0]);
+			ret = read_one_block_group(info, path, &key,
+						need_clear);
+			if (ret < 0)
+				goto error;
+			key.objectid += key.offset;
+			key.offset = 0;
+			btrfs_release_path(path);
+		}
 	}
 
 	rcu_read_lock();
@@ -2076,6 +2185,16 @@ static int insert_block_group_item(struct btrfs_trans_handle *trans,
 	struct btrfs_root *root;
 	struct btrfs_key key;
 
+	if (btrfs_fs_incompat(fs_info, SKINNY_BG_TREE)) {
+		spin_lock(&block_group->lock);
+		key.objectid = block_group->start;
+		key.type = BTRFS_SKINNY_BLOCK_GROUP_ITEM_KEY;
+		key.offset = block_group->used;
+		spin_unlock(&block_group->lock);
+
+		root = fs_info->bg_root;
+		return btrfs_insert_item(trans, root, &key, 0, 0);
+	}
 	spin_lock(&block_group->lock);
 	btrfs_set_stack_block_group_used(&bgi, block_group->used);
 	btrfs_set_stack_block_group_chunk_objectid(&bgi,
@@ -2358,6 +2477,27 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	unsigned long bi;
 	int ret;
 
+	if (btrfs_fs_incompat(fs_info, SKINNY_BG_TREE)) {
+		key.objectid = cache->start;
+		key.type = BTRFS_SKINNY_BLOCK_GROUP_ITEM_KEY;
+		key.offset = (u64)-1;
+		root = fs_info->bg_root;
+
+		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+		if (ret == 0)
+			ret = -EUCLEAN;
+		if (ret < 0)
+			goto fail;
+		ret = btrfs_previous_item(root, path, key.objectid, key.type);
+		if (ret > 0)
+			ret = -ENOENT;
+		if (ret < 0)
+			goto fail;
+		key.offset = cache->used;
+		btrfs_set_item_key_safe(fs_info, path, &key);
+		btrfs_release_path(path);
+		return 0;
+	}
 	key.objectid = cache->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 	key.offset = cache->length;
diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index dbba53e712e6..f0d36f1c8c6e 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -432,6 +432,8 @@ void btrfs_init_global_block_rsv(struct btrfs_fs_info *fs_info)
 	fs_info->tree_root->block_rsv = &fs_info->global_block_rsv;
 	if (fs_info->quota_root)
 		fs_info->quota_root->block_rsv = &fs_info->global_block_rsv;
+	if (fs_info->bg_root)
+		fs_info->bg_root->block_rsv = &fs_info->global_block_rsv;
 	fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv;
 
 	btrfs_update_global_block_rsv(fs_info);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 03ea7370aea7..e40131d662b8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -302,7 +302,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
-	 BTRFS_FEATURE_INCOMPAT_RAID1C34)
+	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
+	 BTRFS_FEATURE_INCOMPAT_SKINNY_BG_TREE)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
@@ -582,6 +583,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *quota_root;
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *free_space_root;
+	struct btrfs_root *bg_root;
 
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8ad451695d49..56675d3cd23a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1516,6 +1516,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	kfree(fs_info->balance_ctl);
 	kfree(fs_info->delayed_root);
 	btrfs_put_root(fs_info->extent_root);
+	btrfs_put_root(fs_info->bg_root);
 	btrfs_put_root(fs_info->tree_root);
 	btrfs_put_root(fs_info->chunk_root);
 	btrfs_put_root(fs_info->dev_root);
@@ -1531,7 +1532,6 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	kvfree(fs_info);
 }
 
-
 struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 				     struct btrfs_key *location,
 				     bool check_ref)
@@ -1560,6 +1560,10 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
 		return btrfs_grab_root(fs_info->free_space_root) ?
 			fs_info->free_space_root : ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_BLOCK_GROUP_TREE_OBJECTID)
+		return btrfs_grab_root(fs_info->bg_root) ?
+			fs_info->bg_root : ERR_PTR(-ENOENT);
+
 again:
 	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root) {
@@ -1983,6 +1987,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
 	if (free_chunk_root)
 		free_root_extent_buffers(info->chunk_root);
 	free_root_extent_buffers(info->free_space_root);
+	free_root_extent_buffers(info->bg_root);
 }
 
 void btrfs_put_root(struct btrfs_root *root)
@@ -2267,6 +2272,17 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
 	fs_info->extent_root = root;
 
+	if (btrfs_fs_incompat(fs_info, SKINNY_BG_TREE)) {
+		location.objectid = BTRFS_BLOCK_GROUP_TREE_OBJECTID;
+		root = btrfs_read_tree_root(tree_root, &location);
+		if (IS_ERR(root)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->bg_root = root;
+	}
+
 	location.objectid = BTRFS_DEV_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index a39bff64ff24..d71bd4949636 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -261,6 +261,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_INCOMPAT(skinny_bg_tree, SKINNY_BG_TREE);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -276,6 +277,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
+	BTRFS_FEAT_ATTR_PTR(skinny_bg_tree),
 	NULL
 };
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index e6b6cb0f8bc6..dc8675d892a4 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -290,6 +290,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
+#define BTRFS_FEATURE_INCOMPAT_SKINNY_BG_TREE	(1ULL << 12)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 8e322e2c7e78..c1b63a63db9f 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -48,6 +48,9 @@
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* sotre SKINNY_BLOCK_GROUPs in a seprate tree */
+#define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -182,6 +185,14 @@
  */
 #define BTRFS_BLOCK_GROUP_ITEM_KEY 192
 
+/*
+ * A more optimized structure for block group item.
+ * key.objectid = bg->start
+ * key.offset = bg->used
+ * No item needed, all other info can be extracted from corresponding chunk.
+ */
+#define BTRFS_SKINNY_BLOCK_GROUP_ITEM_KEY 193
+
 /*
  * Every block group is represented in the free space tree by a free space info
  * item, which stores some accounting information. It is keyed on
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 7/7] btrfs: tree-checker: Introduce checks for skinny block group item
  2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
                   ` (5 preceding siblings ...)
  2020-05-04 23:58 ` [PATCH v4 6/7] btrfs: Introduce new incompat feature, SKINNY_BG_TREE, to hugely reduce mount time Qu Wenruo
@ 2020-05-04 23:58 ` Qu Wenruo
  2020-05-11 19:21 ` [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time David Sterba
  7 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:58 UTC (permalink / raw)
  To: linux-btrfs

The check is pretty simple, just two checks:
- Tree owner check
  Skinny block group item should only exist in block group tree.

- Used num bytes (key->offset)
  To avoid possible later chunk size change, here we use super large
  value (64T) as threshold to reduce false alert.

- Duplicated skinny block group items
  There shouldn't be duplicated items for the same block group.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/tree-checker.c | 56 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 517b44300a05..f5eb91b4139f 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -621,11 +621,18 @@ static void block_group_err(const struct extent_buffer *eb, int slot,
 	vaf.fmt = fmt;
 	vaf.va = &args;
 
-	btrfs_crit(fs_info,
-	"corrupt %s: root=%llu block=%llu slot=%d bg_start=%llu bg_len=%llu, %pV",
-		btrfs_header_level(eb) == 0 ? "leaf" : "node",
-		btrfs_header_owner(eb), btrfs_header_bytenr(eb), slot,
-		key.objectid, key.offset, &vaf);
+	if (key.type == BTRFS_BLOCK_GROUP_ITEM_KEY)
+		btrfs_crit(fs_info,
+"corrupt %s: root=%llu block=%llu slot=%d bg_start=%llu bg_len=%llu, %pV",
+			btrfs_header_level(eb) == 0 ? "leaf" : "node",
+			btrfs_header_owner(eb), btrfs_header_bytenr(eb), slot,
+			key.objectid, key.offset, &vaf);
+	else
+		btrfs_crit(fs_info,
+"corrupt %s: root=%llu block=%llu slot=%d bg_start=%llu, %pV",
+			btrfs_header_level(eb) == 0 ? "leaf" : "node",
+			btrfs_header_owner(eb), btrfs_header_bytenr(eb), slot,
+			key.objectid, &vaf);
 	va_end(args);
 }
 
@@ -698,6 +705,42 @@ static int check_block_group_item(struct extent_buffer *leaf,
 	return 0;
 }
 
+static int check_skinny_block_group_item(struct extent_buffer *leaf,
+					 struct btrfs_key *key,
+					 struct btrfs_key *prev_key, int slot)
+{
+	if (btrfs_header_owner(leaf) != BTRFS_BLOCK_GROUP_TREE_OBJECTID) {
+		block_group_err(leaf, slot,
+	"bad tree owner for skinny block group item, have %llu expect %llu",
+				btrfs_header_owner(leaf),
+				BTRFS_BLOCK_GROUP_TREE_OBJECTID);
+		return -EUCLEAN;
+	}
+	/*
+	 * We can't do direct key->offset check, but at least we shouldn't
+	 * have anything larger than max chunk size.
+	 * Here we use something way larger than that to ensure we only catch
+	 * obviouly wrong result.
+	 */
+	if (key->offset >= SZ_64T) {
+		block_group_err(leaf, slot,
+			"too large used num bytes, have %llu expect [0, %llu)",
+				key->offset, BTRFS_MAX_DATA_CHUNK_SIZE);
+		return -EUCLEAN;
+	}
+
+	/*
+	 * There shouldn't be any duplicated skinny bg items
+	 * (same objectid but different offset)
+	 */
+	if (slot > 0 && prev_key->objectid == key->objectid) {
+		block_group_err(leaf, slot,
+			"duplicated skinny block group items found");
+		return -EUCLEAN;
+	}
+	return 0;
+}
+
 __printf(4, 5)
 __cold
 static void chunk_err(const struct extent_buffer *leaf,
@@ -1511,6 +1554,9 @@ static int check_leaf_item(struct extent_buffer *leaf,
 	case BTRFS_BLOCK_GROUP_ITEM_KEY:
 		ret = check_block_group_item(leaf, key, slot);
 		break;
+	case BTRFS_SKINNY_BLOCK_GROUP_ITEM_KEY:
+		ret = check_skinny_block_group_item(leaf, key, prev_key, slot);
+		break;
 	case BTRFS_CHUNK_ITEM_KEY:
 		chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
 		ret = check_leaf_chunk_item(leaf, chunk, key, slot);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/7] btrfs: block-group: Refactor how we delete one block group item
  2020-05-04 23:58 ` [PATCH v4 3/7] btrfs: block-group: Refactor how we delete " Qu Wenruo
@ 2020-05-05  8:47   ` Johannes Thumshirn
  2020-05-05  8:55     ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Thumshirn @ 2020-05-05  8:47 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 05/05/2020 01:58, Qu Wenruo wrote:
> When deleting a block group item, it's pretty straight forward, just
> delete the item pointed by the key.
> 
> However it will not be that straight-forward for incoming skinny block
> group item.
> 
> So refactor the block group item deletion into a new function,
> remove_block_group_item(), also to make the already lengthy
> btrfs_remove_block_group() a little shorter.

I think this patch is useful even without the skinny_bg feature.

> +static int remove_block_group_item(struct btrfs_trans_handle *trans,
> +				   struct btrfs_path *path,
> +				   struct btrfs_block_group *block_group)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_root *root;

Tiny nitpick, why not:

	struct btrfs_root *root = fs_info->extent_root;

Like it was in brtfs_remove_block_group()?

Anyways looks good to me,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/7] btrfs: block-group: Refactor how we delete one block group item
  2020-05-05  8:47   ` Johannes Thumshirn
@ 2020-05-05  8:55     ` Qu Wenruo
  2020-05-05  8:56       ` Johannes Thumshirn
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2020-05-05  8:55 UTC (permalink / raw)
  To: Johannes Thumshirn, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1299 bytes --]



On 2020/5/5 下午4:47, Johannes Thumshirn wrote:
> On 05/05/2020 01:58, Qu Wenruo wrote:
>> When deleting a block group item, it's pretty straight forward, just
>> delete the item pointed by the key.
>>
>> However it will not be that straight-forward for incoming skinny block
>> group item.
>>
>> So refactor the block group item deletion into a new function,
>> remove_block_group_item(), also to make the already lengthy
>> btrfs_remove_block_group() a little shorter.
> 
> I think this patch is useful even without the skinny_bg feature.
> 
>> +static int remove_block_group_item(struct btrfs_trans_handle *trans,
>> +				   struct btrfs_path *path,
>> +				   struct btrfs_block_group *block_group)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_root *root;
> 
> Tiny nitpick, why not:
> 
> 	struct btrfs_root *root = fs_info->extent_root;
> 
> Like it was in brtfs_remove_block_group()?

That's mostly for the skinny_bg_tree (6th) patch, as in that patch,
skinny_bg_tree feature goes to pick bg_root other than extent root.

So I didn't initialize root here, but leaves it assigned the same timing
as key.

Thanks,
Qu

> 
> Anyways looks good to me,
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/7] btrfs: block-group: Refactor how we delete one block group item
  2020-05-05  8:55     ` Qu Wenruo
@ 2020-05-05  8:56       ` Johannes Thumshirn
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2020-05-05  8:56 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 05/05/2020 10:55, Qu Wenruo wrote:
> That's mostly for the skinny_bg_tree (6th) patch, as in that patch,
> skinny_bg_tree feature goes to pick bg_root other than extent root.
> 
> So I didn't initialize root here, but leaves it assigned the same timing
> as key.


Ah OK I'm not this far in the patchset.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/7] btrfs: block-group: Refactor how we insert a block group item
  2020-05-04 23:58 ` [PATCH v4 4/7] btrfs: block-group: Refactor how we insert a " Qu Wenruo
@ 2020-05-05  8:59   ` Johannes Thumshirn
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2020-05-05  8:59 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good and usfull on its own as well IMHO,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 5/7] btrfs: block-group: Rename write_one_cahce_group()
  2020-05-04 23:58 ` [PATCH v4 5/7] btrfs: block-group: Rename write_one_cahce_group() Qu Wenruo
@ 2020-05-05  9:11   ` Johannes Thumshirn
  2020-05-11 19:19   ` David Sterba
  1 sibling, 0 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2020-05-05  9:11 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good on its own as well,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Btw, you have a type in the Subject line s/cahce/cache/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 6/7] btrfs: Introduce new incompat feature, SKINNY_BG_TREE, to hugely reduce mount time
  2020-05-04 23:58 ` [PATCH v4 6/7] btrfs: Introduce new incompat feature, SKINNY_BG_TREE, to hugely reduce mount time Qu Wenruo
@ 2020-05-05 19:10   ` Johannes Thumshirn
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2020-05-05 19:10 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 05/05/2020 01:58, Qu Wenruo wrote:
> int btrfs_read_block_groups(struct btrfs_fs_info *info)
>   {
>   	struct btrfs_path *path;
> @@ -2022,20 +2124,27 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>   	if (btrfs_test_opt(info, CLEAR_CACHE))
>   		need_clear = 1;
>   
> -	while (1) {
> -		ret = find_first_block_group(info, path, &key);
> -		if (ret > 0)
> -			break;
> -		if (ret != 0)
> -			goto error;
> +	if (btrfs_fs_incompat(info, SKINNY_BG_TREE)) {
> +		path->reada = READA_FORWARD;
> +		ret = read_skinny_block_groups(info, path, need_clear);
> +	} else {
> +		while (1) {
> +			ret = find_first_block_group(info, path, &key);
> +			if (ret > 0)
> +				break;
> +			if (ret != 0)
> +				goto error;
>   
> -		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> -		ret = read_one_block_group(info, path, &key, need_clear);
> -		if (ret < 0)
> -			goto error;
> -		key.objectid += key.offset;
> -		key.offset = 0;
> -		btrfs_release_path(path);
> +			btrfs_item_key_to_cpu(path->nodes[0], &key,
> +						path->slots[0]);
> +			ret = read_one_block_group(info, path, &key,
> +						need_clear);
> +			if (ret < 0)
> +				goto error;
> +			key.objectid += key.offset;
> +			key.offset = 0;
> +			btrfs_release_path(path);
> +		}
>   	}
>   

It might be worth considering to move the 'else' path into a function on 
it's own, similar what you did with read_skinny_block_groups().

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 5/7] btrfs: block-group: Rename write_one_cahce_group()
  2020-05-04 23:58 ` [PATCH v4 5/7] btrfs: block-group: Rename write_one_cahce_group() Qu Wenruo
  2020-05-05  9:11   ` Johannes Thumshirn
@ 2020-05-11 19:19   ` David Sterba
  1 sibling, 0 replies; 16+ messages in thread
From: David Sterba @ 2020-05-11 19:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, May 05, 2020 at 07:58:23AM +0800, Qu Wenruo wrote:
> The name of this function contains the word "cache", which is left from
> the era where btrfs_block_group is called btrfs_block_group_cache.
> 
> Now this "cache" doesn't match any thing, and we have better namings for
> functions like read/insert/remove_block_group_item().
> 
> So rename this function to update_block_group_item().
> 
> Since we're here, also rename the local variables to be more like a
> Chrismas tree, and rename @extent_root to @root for later reuse.

Please don't do that. The reverse chrismas tree is Somebody Else's
Coding Style and you introduce unrelated changes where only function
rename is expected. All reverted.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time
  2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
                   ` (6 preceding siblings ...)
  2020-05-04 23:58 ` [PATCH v4 7/7] btrfs: tree-checker: Introduce checks for skinny block group item Qu Wenruo
@ 2020-05-11 19:21 ` David Sterba
  7 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2020-05-11 19:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, May 05, 2020 at 07:58:18AM +0800, Qu Wenruo wrote:
> Qu Wenruo (7):
>   btrfs: block-group: Don't set the wrong READA flag for
>     btrfs_read_block_groups()
>   btrfs: block-group: Refactor how we read one block group item
>   btrfs: block-group: Refactor how we delete one block group item
>   btrfs: block-group: Refactor how we insert a block group item
>   btrfs: block-group: Rename write_one_cahce_group()
>   btrfs: Introduce new incompat feature, SKINNY_BG_TREE, to hugely
>     reduce mount time
>   btrfs: tree-checker: Introduce checks for skinny block group item

Patches 1-5 added to misc-next as they're good cleanups.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-05-11 19:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-04 23:58 [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time Qu Wenruo
2020-05-04 23:58 ` [PATCH v4 1/7] btrfs: block-group: Don't set the wrong READA flag for btrfs_read_block_groups() Qu Wenruo
2020-05-04 23:58 ` [PATCH v4 2/7] btrfs: block-group: Refactor how we read one block group item Qu Wenruo
2020-05-04 23:58 ` [PATCH v4 3/7] btrfs: block-group: Refactor how we delete " Qu Wenruo
2020-05-05  8:47   ` Johannes Thumshirn
2020-05-05  8:55     ` Qu Wenruo
2020-05-05  8:56       ` Johannes Thumshirn
2020-05-04 23:58 ` [PATCH v4 4/7] btrfs: block-group: Refactor how we insert a " Qu Wenruo
2020-05-05  8:59   ` Johannes Thumshirn
2020-05-04 23:58 ` [PATCH v4 5/7] btrfs: block-group: Rename write_one_cahce_group() Qu Wenruo
2020-05-05  9:11   ` Johannes Thumshirn
2020-05-11 19:19   ` David Sterba
2020-05-04 23:58 ` [PATCH v4 6/7] btrfs: Introduce new incompat feature, SKINNY_BG_TREE, to hugely reduce mount time Qu Wenruo
2020-05-05 19:10   ` Johannes Thumshirn
2020-05-04 23:58 ` [PATCH v4 7/7] btrfs: tree-checker: Introduce checks for skinny block group item Qu Wenruo
2020-05-11 19:21 ` [PATCH v4 0/7] btrfs: Introduce new incompat feature SKINNY_BG_TREE to hugely reduce mount time David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.