linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
@ 2019-10-08  4:49 Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info Qu Wenruo
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/bg_tree
Which is based on v5.2.2 tag.

This patchset provides the needed user space infrastructure for BG_TREE
feature.

Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
is needed to convert existing fs (unmounted) to new format.

Now btrfstune can convert regular extent tree fs to bg tree fs to
improve mount time.

For the performance improvement, please check the kernel patchset cover
letter or the last patch.
(SPOILER ALERT: It's super fast)

Changelog:
v2:
- Rebase to v5.2.2 tag
- Add btrfstune ability to convert existing fs to BG_TREE feature

Qu Wenruo (7):
  btrfs-progs: Refactor excluded extent functions to use fs_info
  btrfs-progs: Refactor btrfs_read_block_groups()
  btrfs-progs: Enable read-write ability for 'bg_tree' feature
  btrfs-progs: mkfs: Introduce -O bg-tree
  btrfs-progs: dump-tree/dump-super: Introduce support for bg tree
  btrfs-progs: check: Introduce support for bg-tree feature
  btrfs-progs: btrfstune: Allow to enable bg-tree feature offline

 Documentation/btrfstune.asciidoc |   6 +
 btrfstune.c                      |  44 +++-
 check/main.c                     |   7 +-
 check/mode-lowmem.c              |   9 +-
 cmds/inspect-dump-super.c        |   3 +-
 cmds/inspect-dump-tree.c         |   5 +
 common/fsfeatures.c              |   6 +
 ctree.h                          |  18 +-
 disk-io.c                        |  29 ++-
 extent-tree.c                    | 364 +++++++++++++++++++++++--------
 mkfs/common.c                    |   6 +-
 mkfs/main.c                      |   9 +
 print-tree.c                     |   3 +
 transaction.c                    |   1 +
 14 files changed, 402 insertions(+), 108 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info
  2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-08  9:22   ` Johannes Thumshirn
  2019-10-17  2:16   ` Anand Jain
  2019-10-08  4:49 ` [PATCH v2 2/7] btrfs-progs: Refactor btrfs_read_block_groups() Qu Wenruo
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

The following functions are just using @root to reach fs_info:
- exclude_super_stripes
- free_excluded_extents
- add_excluded_extent

Refactor them to use fs_info directly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 check/main.c  |  4 ++--
 ctree.h       |  4 ++--
 extent-tree.c | 20 ++++++++++----------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/check/main.c b/check/main.c
index c2d0f3949c5e..94ffab46cb70 100644
--- a/check/main.c
+++ b/check/main.c
@@ -5552,7 +5552,7 @@ static int check_space_cache(struct btrfs_root *root)
 		}
 
 		if (btrfs_fs_compat_ro(root->fs_info, FREE_SPACE_TREE)) {
-			ret = exclude_super_stripes(root, cache);
+			ret = exclude_super_stripes(root->fs_info, cache);
 			if (ret) {
 				errno = -ret;
 				fprintf(stderr,
@@ -5561,7 +5561,7 @@ static int check_space_cache(struct btrfs_root *root)
 				continue;
 			}
 			ret = load_free_space_tree(root->fs_info, cache);
-			free_excluded_extents(root, cache);
+			free_excluded_extents(root->fs_info, cache);
 			if (ret < 0) {
 				errno = -ret;
 				fprintf(stderr,
diff --git a/ctree.h b/ctree.h
index 0d12563b7261..8c7b3cb40151 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2568,9 +2568,9 @@ int btrfs_record_file_extent(struct btrfs_trans_handle *trans,
 			      u64 num_bytes);
 int btrfs_free_block_group(struct btrfs_trans_handle *trans,
 			   struct btrfs_fs_info *fs_info, u64 bytenr, u64 len);
-void free_excluded_extents(struct btrfs_root *root,
+void free_excluded_extents(struct btrfs_fs_info *fs_info,
 			   struct btrfs_block_group_cache *cache);
-int exclude_super_stripes(struct btrfs_root *root,
+int exclude_super_stripes(struct btrfs_fs_info *fs_info,
 			  struct btrfs_block_group_cache *cache);
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
 		       struct btrfs_fs_info *info, u64 start, u64 end);
diff --git a/extent-tree.c b/extent-tree.c
index 932af2c644bd..19d1ea0df570 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2716,7 +2716,7 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 		if (btrfs_chunk_readonly(info, cache->key.objectid))
 			cache->ro = 1;
 
-		exclude_super_stripes(root, cache);
+		exclude_super_stripes(info, cache);
 
 		ret = update_space_info(info, cache->flags, found_key.offset,
 					btrfs_block_group_used(&cache->item),
@@ -2760,7 +2760,7 @@ btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
 	cache->flags = type;
 	btrfs_set_block_group_flags(&cache->item, type);
 
-	exclude_super_stripes(fs_info->extent_root, cache);
+	exclude_super_stripes(fs_info, cache);
 	ret = update_space_info(fs_info, cache->flags, size, bytes_used,
 				&cache->space_info);
 	BUG_ON(ret);
@@ -3551,16 +3551,16 @@ int btrfs_record_file_extent(struct btrfs_trans_handle *trans,
 }
 
 
-static int add_excluded_extent(struct btrfs_root *root,
+static int add_excluded_extent(struct btrfs_fs_info *fs_info,
 			       u64 start, u64 num_bytes)
 {
 	u64 end = start + num_bytes - 1;
-	set_extent_bits(&root->fs_info->pinned_extents,
+	set_extent_bits(&fs_info->pinned_extents,
 			start, end, EXTENT_UPTODATE);
 	return 0;
 }
 
-void free_excluded_extents(struct btrfs_root *root,
+void free_excluded_extents(struct btrfs_fs_info *fs_info,
 			   struct btrfs_block_group_cache *cache)
 {
 	u64 start, end;
@@ -3568,11 +3568,11 @@ void free_excluded_extents(struct btrfs_root *root,
 	start = cache->key.objectid;
 	end = start + cache->key.offset - 1;
 
-	clear_extent_bits(&root->fs_info->pinned_extents,
+	clear_extent_bits(&fs_info->pinned_extents,
 			  start, end, EXTENT_UPTODATE);
 }
 
-int exclude_super_stripes(struct btrfs_root *root,
+int exclude_super_stripes(struct btrfs_fs_info *fs_info,
 			  struct btrfs_block_group_cache *cache)
 {
 	u64 bytenr;
@@ -3583,7 +3583,7 @@ int exclude_super_stripes(struct btrfs_root *root,
 	if (cache->key.objectid < BTRFS_SUPER_INFO_OFFSET) {
 		stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->key.objectid;
 		cache->bytes_super += stripe_len;
-		ret = add_excluded_extent(root, cache->key.objectid,
+		ret = add_excluded_extent(fs_info, cache->key.objectid,
 					  stripe_len);
 		if (ret)
 			return ret;
@@ -3591,7 +3591,7 @@ int exclude_super_stripes(struct btrfs_root *root,
 
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
 		bytenr = btrfs_sb_offset(i);
-		ret = btrfs_rmap_block(root->fs_info,
+		ret = btrfs_rmap_block(fs_info,
 				       cache->key.objectid, bytenr,
 				       &logical, &nr, &stripe_len);
 		if (ret)
@@ -3618,7 +3618,7 @@ int exclude_super_stripes(struct btrfs_root *root,
 			}
 
 			cache->bytes_super += len;
-			ret = add_excluded_extent(root, start, len);
+			ret = add_excluded_extent(fs_info, start, len);
 			if (ret) {
 				kfree(logical);
 				return ret;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/7] btrfs-progs: Refactor btrfs_read_block_groups()
  2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-17  3:23   ` Anand Jain
  2019-10-08  4:49 ` [PATCH v2 3/7] btrfs-progs: Enable read-write ability for 'bg_tree' feature Qu Wenruo
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

This patch does the following refactor:
- Refactor parameter from @root to @fs_info

- Refactor the large loop body into another function
  Now we have a helper function, read_one_block_group(), to handle
  block group cache and space info related routine.

- Refactor the return value
  Even we have the code handling ret > 0 from find_first_block_group(),
  it never works, as when there is no more block group,
  find_first_block_group() just return -ENOENT other than 1.

  This is super confusing, it's almost a mircle it even works.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 ctree.h       |   2 +-
 disk-io.c     |   9 ++-
 extent-tree.c | 160 ++++++++++++++++++++++++++++----------------------
 3 files changed, 97 insertions(+), 74 deletions(-)

diff --git a/ctree.h b/ctree.h
index 8c7b3cb40151..2899de358613 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2550,7 +2550,7 @@ int update_space_info(struct btrfs_fs_info *info, u64 flags,
 		      u64 total_bytes, u64 bytes_used,
 		      struct btrfs_space_info **space_info);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
-int btrfs_read_block_groups(struct btrfs_root *root);
+int btrfs_read_block_groups(struct btrfs_fs_info *info);
 struct btrfs_block_group_cache *
 btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
 		      u64 chunk_offset, u64 size);
diff --git a/disk-io.c b/disk-io.c
index be44eead5cef..8978f0cb60c7 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -983,14 +983,17 @@ int btrfs_setup_all_roots(struct btrfs_fs_info *fs_info, u64 root_tree_bytenr,
 	fs_info->last_trans_committed = generation;
 	if (extent_buffer_uptodate(fs_info->extent_root->node) &&
 	    !(flags & OPEN_CTREE_NO_BLOCK_GROUPS)) {
-		ret = btrfs_read_block_groups(fs_info->tree_root);
+		ret = btrfs_read_block_groups(fs_info);
 		/*
 		 * If we don't find any blockgroups (ENOENT) we're either
 		 * restoring or creating the filesystem, where it's expected,
 		 * anything else is error
 		 */
-		if (ret != -ENOENT)
-			return -EIO;
+		if (ret < 0 && ret != -ENOENT) {
+			errno = -ret;
+			error("failed to read block groups: %m");
+			return ret;
+		}
 	}
 
 	key.objectid = BTRFS_FS_TREE_OBJECTID;
diff --git a/extent-tree.c b/extent-tree.c
index 19d1ea0df570..9713d627764c 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2607,6 +2607,13 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 	return 0;
 }
 
+/*
+ * Find a block group which starts >= @key->objectid in extent tree.
+ *
+ * Return 0 for found
+ * Retrun >0 for not found
+ * Return <0 for error
+ */
 static int find_first_block_group(struct btrfs_root *root,
 		struct btrfs_path *path, struct btrfs_key *key)
 {
@@ -2636,36 +2643,95 @@ static int find_first_block_group(struct btrfs_root *root,
 			return 0;
 		path->slots[0]++;
 	}
-	ret = -ENOENT;
+	ret = 1;
 error:
 	return ret;
 }
 
-int btrfs_read_block_groups(struct btrfs_root *root)
+/*
+ * Helper function to read out one BLOCK_GROUP_ITEM and insert it into
+ * block group cache.
+ *
+ * Return 0 if nothing wrong (either insert the bg cache or skip 0 sized bg)
+ * Return <0 for error.
+ */
+static int read_one_block_group(struct btrfs_fs_info *fs_info,
+				 struct btrfs_path *path)
 {
-	struct btrfs_path *path;
-	int ret;
-	int bit;
-	struct btrfs_block_group_cache *cache;
-	struct btrfs_fs_info *info = root->fs_info;
+	struct extent_io_tree *block_group_cache = &fs_info->block_group_cache;
+	struct extent_buffer *leaf = path->nodes[0];
 	struct btrfs_space_info *space_info;
-	struct extent_io_tree *block_group_cache;
+	struct btrfs_block_group_cache *cache;
 	struct btrfs_key key;
-	struct btrfs_key found_key;
-	struct extent_buffer *leaf;
+	int slot = path->slots[0];
+	int bit = 0;
+	int ret;
 
-	block_group_cache = &info->block_group_cache;
+	btrfs_item_key_to_cpu(leaf, &key, slot);
+	ASSERT(key.type == BTRFS_BLOCK_GROUP_ITEM_KEY);
+
+	/*
+	 * Skip 0 sized block group, don't insert them into block
+	 * group cache tree, as its length is 0, it won't get
+	 * freed at close_ctree() time.
+	 */
+	if (key.offset == 0)
+		return 0;
+
+	cache = kzalloc(sizeof(*cache), GFP_NOFS);
+	if (!cache)
+		return -ENOMEM;
+	read_extent_buffer(leaf, &cache->item,
+			   btrfs_item_ptr_offset(leaf, slot),
+			   sizeof(cache->item));
+	memcpy(&cache->key, &key, sizeof(key));
+	cache->cached = 0;
+	cache->pinned = 0;
+	cache->flags = btrfs_block_group_flags(&cache->item);
+	if (cache->flags & BTRFS_BLOCK_GROUP_DATA) {
+		bit = BLOCK_GROUP_DATA;
+	} else if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
+		bit = BLOCK_GROUP_SYSTEM;
+	} else if (cache->flags & BTRFS_BLOCK_GROUP_METADATA) {
+		bit = BLOCK_GROUP_METADATA;
+	}
+	set_avail_alloc_bits(fs_info, cache->flags);
+	if (btrfs_chunk_readonly(fs_info, cache->key.objectid))
+		cache->ro = 1;
+	exclude_super_stripes(fs_info, cache);
+
+	ret = update_space_info(fs_info, cache->flags, cache->key.offset,
+				btrfs_block_group_used(&cache->item),
+				&space_info);
+	if (ret < 0) {
+		free(cache);
+		return ret;
+	}
+	cache->space_info = space_info;
+
+	set_extent_bits(block_group_cache, cache->key.objectid,
+			cache->key.objectid + cache->key.offset - 1,
+			bit | EXTENT_LOCKED);
+	set_state_private(block_group_cache, cache->key.objectid,
+			  (unsigned long)cache);
+	return 0;
+}
 
-	root = info->extent_root;
+int btrfs_read_block_groups(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_path path;
+	struct btrfs_root *root;
+	int ret;
+	struct btrfs_key key;
+
+	root = fs_info->extent_root;
 	key.objectid = 0;
 	key.offset = 0;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
-	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
+	btrfs_init_path(&path);
 
 	while(1) {
-		ret = find_first_block_group(root, path, &key);
+		ret = find_first_block_group(root, &path, &key);
 		if (ret > 0) {
 			ret = 0;
 			goto error;
@@ -2673,67 +2739,21 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 		if (ret != 0) {
 			goto error;
 		}
-		leaf = path->nodes[0];
-		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
 
-		cache = kzalloc(sizeof(*cache), GFP_NOFS);
-		if (!cache) {
-			ret = -ENOMEM;
+		ret = read_one_block_group(fs_info, &path);
+		if (ret < 0)
 			goto error;
-		}
 
-		read_extent_buffer(leaf, &cache->item,
-				   btrfs_item_ptr_offset(leaf, path->slots[0]),
-				   sizeof(cache->item));
-		memcpy(&cache->key, &found_key, sizeof(found_key));
-		cache->cached = 0;
-		cache->pinned = 0;
-		key.objectid = found_key.objectid + found_key.offset;
-		if (found_key.offset == 0)
+		if (key.offset == 0)
 			key.objectid++;
-		btrfs_release_path(path);
-
-		/*
-		 * Skip 0 sized block group, don't insert them into block
-		 * group cache tree, as its length is 0, it won't get
-		 * freed at close_ctree() time.
-		 */
-		if (found_key.offset == 0) {
-			free(cache);
-			continue;
-		}
-
-		cache->flags = btrfs_block_group_flags(&cache->item);
-		bit = 0;
-		if (cache->flags & BTRFS_BLOCK_GROUP_DATA) {
-			bit = BLOCK_GROUP_DATA;
-		} else if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
-			bit = BLOCK_GROUP_SYSTEM;
-		} else if (cache->flags & BTRFS_BLOCK_GROUP_METADATA) {
-			bit = BLOCK_GROUP_METADATA;
-		}
-		set_avail_alloc_bits(info, cache->flags);
-		if (btrfs_chunk_readonly(info, cache->key.objectid))
-			cache->ro = 1;
-
-		exclude_super_stripes(info, cache);
-
-		ret = update_space_info(info, cache->flags, found_key.offset,
-					btrfs_block_group_used(&cache->item),
-					&space_info);
-		BUG_ON(ret);
-		cache->space_info = space_info;
-
-		/* use EXTENT_LOCKED to prevent merging */
-		set_extent_bits(block_group_cache, found_key.objectid,
-				found_key.objectid + found_key.offset - 1,
-				bit | EXTENT_LOCKED);
-		set_state_private(block_group_cache, found_key.objectid,
-				  (unsigned long)cache);
+		else
+			key.objectid = key.objectid + key.offset;
+		btrfs_release_path(&path);
 	}
 	ret = 0;
 error:
-	btrfs_free_path(path);
+	btrfs_release_path(&path);
 	return ret;
 }
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/7] btrfs-progs: Enable read-write ability for 'bg_tree' feature
  2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 2/7] btrfs-progs: Refactor btrfs_read_block_groups() Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-17  4:56   ` Anand Jain
  2019-10-08  4:49 ` [PATCH v2 4/7] btrfs-progs: mkfs: Introduce -O bg-tree Qu Wenruo
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

Allow btrfs-progs to open, read and write 'bg_tree' enabled fs.

The modification itself is not large, as block groups items are only
used at 4 timing:
1) open_ctree()
   We only need to populate fs_info->bg_root and read block group items
   from fs_info->bg_root.
   The obvious change is, we don't need to do btrfs_search_slot() for
   each block group item, but btrfs_next_item() is enough.

   This should hugely reduce open_ctree() execution duration.

2) btrfs_commit_transaction()
   We need to write back dirty block group items back to bg_root.

   The modification here is to insert new block group item if we can't
   find one existing in bg_root, and delete the old one in extent tree
   if we're converting to bg_tree feature.

3) btrfs_make_block_group()
   Just change @root for btrfs_insert_item() from @extent_root to
   @bg_root for fs with that feature.

   This modification needs extra handling for converting case, where
   block group items can be either in extent tree or bg tree.

4) free_block_group_item()
   Just delete the block group item in extent tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 ctree.h       |  11 +++-
 disk-io.c     |  20 +++++++
 extent-tree.c | 152 +++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 168 insertions(+), 15 deletions(-)

diff --git a/ctree.h b/ctree.h
index 2899de358613..c2a18c8ab72f 100644
--- a/ctree.h
+++ b/ctree.h
@@ -89,6 +89,9 @@ struct btrfs_free_space_ctl;
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* store BLOCK_GROUP_ITEMS in a seperate tree */
+#define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -490,6 +493,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID    (1ULL << 10)
+#define BTRFS_FEATURE_INCOMPAT_BG_TREE		(1ULL << 11)
 
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
 
@@ -513,7 +517,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES |		\
-	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID |		\
+	 BTRFS_FEATURE_INCOMPAT_BG_TREE)
 
 /*
  * A leaf is full of items. offset and size tell us where to find
@@ -1123,6 +1128,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *quota_root;
 	struct btrfs_root *free_space_root;
 	struct btrfs_root *uuid_root;
+	struct btrfs_root *bg_root;
 
 	struct rb_root fs_root_tree;
 
@@ -1175,6 +1181,9 @@ struct btrfs_fs_info {
 	unsigned int avoid_sys_chunk_alloc:1;
 	unsigned int finalize_on_close:1;
 
+	/* Converting from bg in extent tree to bg tree */
+	unsigned int convert_to_bg_tree:1;
+
 	int transaction_aborted;
 
 	int (*free_extent_hook)(struct btrfs_fs_info *fs_info,
diff --git a/disk-io.c b/disk-io.c
index 8978f0cb60c7..38248aa895b8 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -716,6 +716,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
 		return fs_info->free_space_root ? fs_info->free_space_root :
 						ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_BLOCK_GROUP_TREE_OBJECTID)
+		return fs_info->bg_root ? fs_info->bg_root : ERR_PTR(-ENOENT);
 
 	BUG_ON(location->objectid == BTRFS_TREE_RELOC_OBJECTID ||
 	       location->offset != (u64)-1);
@@ -768,6 +770,7 @@ struct btrfs_fs_info *btrfs_new_fs_info(int writable, u64 sb_bytenr)
 	fs_info->quota_root = calloc(1, sizeof(struct btrfs_root));
 	fs_info->free_space_root = calloc(1, sizeof(struct btrfs_root));
 	fs_info->uuid_root = calloc(1, sizeof(struct btrfs_root));
+	fs_info->bg_root = calloc(1, sizeof(struct btrfs_root));
 	fs_info->super_copy = calloc(1, BTRFS_SUPER_INFO_SIZE);
 
 	if (!fs_info->tree_root || !fs_info->extent_root ||
@@ -930,6 +933,21 @@ int btrfs_setup_all_roots(struct btrfs_fs_info *fs_info, u64 root_tree_bytenr,
 		return ret;
 	fs_info->extent_root->track_dirty = 1;
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE)) {
+		ret = setup_root_or_create_block(fs_info, flags,
+					fs_info->bg_root,
+					BTRFS_BLOCK_GROUP_TREE_OBJECTID, "bg");
+		if (ret < 0) {
+			error("Couldn't setup bg tree");
+			return ret;
+		}
+		fs_info->bg_root->track_dirty = 1;
+		fs_info->bg_root->ref_cows = 0;
+	} else {
+		free(fs_info->bg_root);
+		fs_info->bg_root = NULL;
+	}
+
 	ret = find_and_setup_root(root, fs_info, BTRFS_DEV_TREE_OBJECTID,
 				  fs_info->dev_root);
 	if (ret) {
@@ -1012,6 +1030,8 @@ void btrfs_release_all_roots(struct btrfs_fs_info *fs_info)
 		free_extent_buffer(fs_info->free_space_root->node);
 	if (fs_info->quota_root)
 		free_extent_buffer(fs_info->quota_root->node);
+	if (fs_info->bg_root)
+		free_extent_buffer(fs_info->bg_root->node);
 	if (fs_info->csum_root)
 		free_extent_buffer(fs_info->csum_root->node);
 	if (fs_info->dev_root)
diff --git a/extent-tree.c b/extent-tree.c
index 9713d627764c..cb3d7a1add0f 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1528,22 +1528,68 @@ static int write_one_cache_group(struct btrfs_trans_handle *trans,
 				 struct btrfs_path *path,
 				 struct btrfs_block_group_cache *cache)
 {
+	bool is_bg_tree = btrfs_fs_incompat(trans->fs_info, BG_TREE);
 	int ret;
-	struct btrfs_root *extent_root = trans->fs_info->extent_root;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_root *root;
 	unsigned long bi;
 	struct extent_buffer *leaf;
 
-	ret = btrfs_search_slot(trans, extent_root, &cache->key, path, 0, 1);
+	if (is_bg_tree)
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+
+	ret = btrfs_search_slot(trans, root, &cache->key, path, 0, 1);
 	if (ret < 0)
-		goto fail;
-	BUG_ON(ret);
+		goto out;
 
-	leaf = path->nodes[0];
-	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
-	write_extent_buffer(leaf, &cache->item, bi, sizeof(cache->item));
-	btrfs_mark_buffer_dirty(leaf);
-	btrfs_release_path(path);
-fail:
+	if (ret == 0) {
+		/* Update existing bg */
+		leaf = path->nodes[0];
+		bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
+		write_extent_buffer(leaf, &cache->item, bi, sizeof(cache->item));
+		btrfs_mark_buffer_dirty(leaf);
+		btrfs_release_path(path);
+	} else {
+		btrfs_release_path(path);
+
+		/*
+		 * Insert new bg item
+		 *
+		 * This only happens for bg_tree feature
+		 */
+		if (!is_bg_tree) {
+			error("can't find block group item for bytenr %llu",
+			      cache->key.objectid);
+			ret = -ENOENT;
+			goto out;
+		}
+		ret = btrfs_insert_item(trans, root, &cache->key, &cache->item,
+					sizeof(cache->item));
+		if (ret < 0)
+			goto out;
+
+		/* Also delete the existing one in next tree if needed */
+		if (fs_info->convert_to_bg_tree) {
+			ret = btrfs_search_slot(trans, fs_info->extent_root,
+						&cache->key, path, -1, 1);
+			if (ret < 0) {
+				btrfs_release_path(path);
+				goto out;
+			}
+			/* Good, already converted */
+			if (ret > 0) {
+				ret = 0;
+				btrfs_release_path(path);
+				goto out;
+			}
+			/* Delete old block group item in extent tree */
+			ret = btrfs_del_item(trans, fs_info->extent_root, path);
+			btrfs_release_path(path);
+		}
+	}
+out:
 	if (ret)
 		return ret;
 	return 0;
@@ -2717,14 +2763,66 @@ static int read_one_block_group(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+static int read_block_group_tree(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_root *root = fs_info->bg_root;
+	struct btrfs_key key = { 0 };
+	struct btrfs_path path;
+	int ret;
+
+	btrfs_init_path(&path);
+	ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0);
+	if (ret < 0) {
+		errno = -ret;
+		error("failed to search block group tree: %m");
+		return ret;
+	}
+	if (ret == 0)
+		goto invalid_key;
+
+	while (1) {
+		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
+		if (key.type != BTRFS_BLOCK_GROUP_ITEM_KEY)
+			goto invalid_key;
+
+		ret = read_one_block_group(fs_info, &path);
+		if (ret < 0) {
+			errno = -ret;
+			error("failed to read one block group: %m");
+			goto out;
+		}
+		ret = btrfs_next_item(root, &path);
+		if (ret < 0) {
+			errno = -ret;
+			error("failed to search block group tree: %m");
+			goto out;
+		}
+		if (ret > 0) {
+			ret = 0;
+			break;
+		}
+	}
+out:
+	btrfs_release_path(&path);
+	return ret;
+
+invalid_key:
+	error("invalid key (%llu, %u, %llu) found in block group tree",
+	      key.objectid, key.type, key.offset);
+	btrfs_release_path(&path);
+	return -EUCLEAN;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_path path;
-	struct btrfs_root *root;
+	struct btrfs_root *root = fs_info->extent_root;
 	int ret;
 	struct btrfs_key key;
 
-	root = fs_info->extent_root;
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		return read_block_group_tree(fs_info);
+
 	key.objectid = 0;
 	key.offset = 0;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
@@ -2804,12 +2902,17 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 			   u64 type, u64 chunk_offset, u64 size)
 {
 	int ret;
-	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_root *root;
 	struct btrfs_block_group_cache *cache;
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+
 	cache = btrfs_add_block_group(fs_info, bytes_used, type, chunk_offset,
 				      size);
-	ret = btrfs_insert_item(trans, extent_root, &cache->key, &cache->item,
+	ret = btrfs_insert_item(trans, root, &cache->key, &cache->item,
 				sizeof(cache->item));
 	BUG_ON(ret);
 
@@ -2943,8 +3046,16 @@ static int free_block_group_item(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
+	/* Using bg tree only */
+	if (btrfs_fs_incompat(fs_info, BG_TREE) && !fs_info->convert_to_bg_tree)
+		goto bg_tree;
+
 	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 	if (ret > 0) {
+		if (btrfs_fs_incompat(fs_info, BG_TREE)) {
+			btrfs_release_path(path);
+			goto bg_tree;
+		}
 		ret = -ENOENT;
 		goto out;
 	}
@@ -2952,6 +3063,19 @@ static int free_block_group_item(struct btrfs_trans_handle *trans,
 		goto out;
 
 	ret = btrfs_del_item(trans, root, path);
+	goto out;
+
+bg_tree:
+	root = fs_info->bg_root;
+	ret = btrfs_search_slot(trans, fs_info->bg_root, &key, path, -1, 1);
+	if (ret < 0)
+		goto out;
+	if (ret > 0) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = btrfs_del_item(trans, root, path);
+
 out:
 	btrfs_free_path(path);
 	return ret;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 4/7] btrfs-progs: mkfs: Introduce -O bg-tree
  2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
                   ` (2 preceding siblings ...)
  2019-10-08  4:49 ` [PATCH v2 3/7] btrfs-progs: Enable read-write ability for 'bg_tree' feature Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-08  8:16   ` [PATCH v2.1 " Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 5/7] btrfs-progs: dump-tree/dump-super: Introduce support for bg tree Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

This allow mkfs.btrfs to create a btrfs with bg-tree feature.

This patch introduce a global function, btrfs_convert_to_bg_tree() in
extent-tree.c, to do the work.

The workflow is pretty simple:
- Create a new tree block for bg tree
- Set the BG_TREE feature for superblock
- Set the fs_info->convert_to_bg_tree flag
- Mark all block group items as dirty
- Commit transaction
  * With fs_info->convert_to_bg_tree set, we will try to delete the
    BLOCK_GROUP_ITEM in extent tree first, then write the new
    BLOCK_GROUP_ITEM into bg tree.

This btrfs_convert_to_bg_tree() will be used in mkfs after the basic fs
is created.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 common/fsfeatures.c |  6 ++++++
 ctree.h             |  1 +
 extent-tree.c       | 38 ++++++++++++++++++++++++++++++++++++++
 mkfs/common.c       |  6 ++++--
 mkfs/main.c         |  9 +++++++++
 transaction.c       |  1 +
 6 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 50934bd161b0..b9bd70a4b3b6 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -86,6 +86,12 @@ static const struct btrfs_fs_feature {
 		VERSION_TO_STRING2(4,0),
 		NULL, 0,
 		"no explicit hole extents for files" },
+	{ "bg-tree", BTRFS_FEATURE_INCOMPAT_BG_TREE,
+		"bg_tree",
+		VERSION_TO_STRING2(5, 0),
+		NULL, 0,
+		NULL, 0,
+		"store block group items in dedicated tree" },
 	/* Keep this one last */
 	{ "list-all", BTRFS_FEATURE_LIST_ALL, NULL }
 };
diff --git a/ctree.h b/ctree.h
index c2a18c8ab72f..3d3992487a53 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2856,5 +2856,6 @@ int btrfs_read_file(struct btrfs_root *root, u64 ino, u64 start, int len,
 
 /* extent-tree.c */
 int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unsigned long nr);
+int btrfs_convert_to_bg_tree(struct btrfs_trans_handle *trans);
 
 #endif
diff --git a/extent-tree.c b/extent-tree.c
index cb3d7a1add0f..87550ef80e37 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1524,6 +1524,44 @@ int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	return __btrfs_mod_ref(trans, root, buf, record_parent, 0);
 }
 
+int btrfs_convert_to_bg_tree(struct btrfs_trans_handle *trans)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_block_group_cache *bg;
+	struct btrfs_root *bg_root;
+	u64 features = btrfs_super_incompat_flags(fs_info->super_copy);
+	int ret;
+
+	/* create bg tree first */
+	bg_root = btrfs_create_tree(trans, fs_info, BTRFS_BLOCK_GROUP_TREE_OBJECTID);
+	if (IS_ERR(bg_root)) {
+		ret = PTR_ERR(bg_root);
+		errno = -ret;
+		error("failed to create bg tree: %m");
+		return ret;
+	}
+	fs_info->bg_root = bg_root;
+	fs_info->bg_root->track_dirty = 1;
+	fs_info->bg_root->ref_cows = 0;
+
+	/* set BG_TREE feature and mark the fs into bg_tree convert status */
+	btrfs_set_super_incompat_flags(fs_info->super_copy,
+			features | BTRFS_FEATURE_INCOMPAT_BG_TREE);
+	fs_info->convert_to_bg_tree = 1;
+
+	/*
+	 * Mark all block groups dirty so they will get converted to bg tree at
+	 * commit transaction time
+	 */
+	for (bg = btrfs_lookup_first_block_group(fs_info, 0); bg;
+	     bg = btrfs_lookup_first_block_group(fs_info,
+				bg->key.objectid + bg->key.offset))
+		set_extent_bits(&fs_info->block_group_cache, bg->key.objectid,
+				bg->key.objectid + bg->key.offset - 1,
+				BLOCK_GROUP_DIRTY);
+	return 0;
+}
+
 static int write_one_cache_group(struct btrfs_trans_handle *trans,
 				 struct btrfs_path *path,
 				 struct btrfs_block_group_cache *cache)
diff --git a/mkfs/common.c b/mkfs/common.c
index caca5e707233..876193838612 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -111,6 +111,9 @@ static int btrfs_create_tree_root(int fd, struct btrfs_mkfs_config *cfg,
 	return ret;
 }
 
+/* These features will not be set in the temporary fs */
+#define MASKED_FEATURES		(~(BTRFS_FEATURE_INCOMPAT_BG_TREE))
+
 /*
  * @fs_uuid - if NULL, generates a UUID, returns back the new filesystem UUID
  *
@@ -204,7 +207,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_super_csum_type(&super, BTRFS_CSUM_TYPE_CRC32);
 	btrfs_set_super_chunk_root_generation(&super, 1);
 	btrfs_set_super_cache_generation(&super, -1);
-	btrfs_set_super_incompat_flags(&super, cfg->features);
+	btrfs_set_super_incompat_flags(&super, cfg->features & MASKED_FEATURES);
 	if (cfg->label)
 		__strncpy_null(super.label, cfg->label, BTRFS_LABEL_SIZE - 1);
 
@@ -824,4 +827,3 @@ int test_minimum_size(const char *file, u64 min_dev_size)
 	return 0;
 }
 
-
diff --git a/mkfs/main.c b/mkfs/main.c
index b752da13aba9..e05fe8cc2f10 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1299,6 +1299,15 @@ raid_groups:
 		warning(
 	"unable to create uuid tree, will be created after mount: %d", ret);
 
+	/* Bg tree are converted after the fs is created */
+	if (mkfs_cfg.features & BTRFS_FEATURE_INCOMPAT_BG_TREE) {
+		ret = btrfs_convert_to_bg_tree(trans);
+		if (ret < 0) {
+			errno = -ret;
+			error(
+		"bg-tree feature will not be enabled, due to error: %m");
+		}
+	}
 	ret = btrfs_commit_transaction(trans, root);
 	if (ret) {
 		error("unable to commit transaction: %d", ret);
diff --git a/transaction.c b/transaction.c
index 45bb9e1f9de6..5de967fb015f 100644
--- a/transaction.c
+++ b/transaction.c
@@ -225,6 +225,7 @@ commit_tree:
 	root->commit_root = NULL;
 	fs_info->running_transaction = NULL;
 	fs_info->last_trans_committed = transid;
+	fs_info->convert_to_bg_tree = 0;
 	list_for_each_entry(sinfo, &fs_info->space_info, list) {
 		if (sinfo->bytes_reserved) {
 			warning(
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 5/7] btrfs-progs: dump-tree/dump-super: Introduce support for bg tree
  2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
                   ` (3 preceding siblings ...)
  2019-10-08  4:49 ` [PATCH v2 4/7] btrfs-progs: mkfs: Introduce -O bg-tree Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 6/7] btrfs-progs: check: Introduce support for bg-tree feature Qu Wenruo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

Just a new tree called BLOCK_GROUP_TREE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 cmds/inspect-dump-super.c | 3 ++-
 cmds/inspect-dump-tree.c  | 5 +++++
 print-tree.c              | 3 +++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/cmds/inspect-dump-super.c b/cmds/inspect-dump-super.c
index 65fb3506eac6..414d9c2317d8 100644
--- a/cmds/inspect-dump-super.c
+++ b/cmds/inspect-dump-super.c
@@ -229,7 +229,8 @@ static struct readable_flag_entry incompat_flags_array[] = {
 	DEF_INCOMPAT_FLAG_ENTRY(RAID56),
 	DEF_INCOMPAT_FLAG_ENTRY(SKINNY_METADATA),
 	DEF_INCOMPAT_FLAG_ENTRY(NO_HOLES),
-	DEF_INCOMPAT_FLAG_ENTRY(METADATA_UUID)
+	DEF_INCOMPAT_FLAG_ENTRY(METADATA_UUID),
+	DEF_INCOMPAT_FLAG_ENTRY(BG_TREE)
 };
 static const int incompat_flags_num = sizeof(incompat_flags_array) /
 				      sizeof(struct readable_flag_entry);
diff --git a/cmds/inspect-dump-tree.c b/cmds/inspect-dump-tree.c
index e50130a4a161..def5bea39a2b 100644
--- a/cmds/inspect-dump-tree.c
+++ b/cmds/inspect-dump-tree.c
@@ -150,6 +150,7 @@ static u64 treeid_from_string(const char *str, const char **end)
 		{ "CSUM", BTRFS_CSUM_TREE_OBJECTID },
 		{ "CHECKSUM", BTRFS_CSUM_TREE_OBJECTID },
 		{ "QUOTA", BTRFS_QUOTA_TREE_OBJECTID },
+		{ "BG", BTRFS_BLOCK_GROUP_TREE_OBJECTID },
 		{ "UUID", BTRFS_UUID_TREE_OBJECTID },
 		{ "FREE_SPACE", BTRFS_FREE_SPACE_TREE_OBJECTID },
 		{ "TREE_LOG_FIXUP", BTRFS_TREE_LOG_FIXUP_OBJECTID },
@@ -661,6 +662,10 @@ again:
 				if (!skip)
 					printf("free space");
 				break;
+			case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
+				if (!skip)
+					printf("block group");
+				break;
 			case BTRFS_MULTIPLE_OBJECTIDS:
 				if (!skip) {
 					printf("multiple");
diff --git a/print-tree.c b/print-tree.c
index e079f1a971d3..e2a43226ff87 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -770,6 +770,9 @@ void print_objectid(FILE *stream, u64 objectid, u8 type)
 	case BTRFS_FREE_SPACE_TREE_OBJECTID:
 		fprintf(stream, "FREE_SPACE_TREE");
 		break;
+	case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
+		fprintf(stream, "BLOCK_GROUP_TREE");
+		break;
 	case BTRFS_MULTIPLE_OBJECTIDS:
 		fprintf(stream, "MULTIPLE");
 		break;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 6/7] btrfs-progs: check: Introduce support for bg-tree feature
  2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
                   ` (4 preceding siblings ...)
  2019-10-08  4:49 ` [PATCH v2 5/7] btrfs-progs: dump-tree/dump-super: Introduce support for bg tree Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-08  4:49 ` [PATCH v2 7/7] btrfs-progs: btrfstune: Allow to enable bg-tree feature offline Qu Wenruo
  2019-10-14 15:17 ` [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature David Sterba
  7 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

Just some minor modification.

- original mode:
  * Block group item can occur in extent tree and bg tree.
- lowmem mode:
  * search block group items in bg tree if BG_TREE feature is set.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 check/main.c        | 3 ++-
 check/mode-lowmem.c | 9 +++++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/check/main.c b/check/main.c
index 94ffab46cb70..3fd0bb2317bb 100644
--- a/check/main.c
+++ b/check/main.c
@@ -6034,7 +6034,8 @@ static int check_type_with_root(u64 rootid, u8 key_type)
 	case BTRFS_EXTENT_ITEM_KEY:
 	case BTRFS_METADATA_ITEM_KEY:
 	case BTRFS_BLOCK_GROUP_ITEM_KEY:
-		if (rootid != BTRFS_EXTENT_TREE_OBJECTID)
+		if (rootid != BTRFS_EXTENT_TREE_OBJECTID &&
+		    rootid != BTRFS_BLOCK_GROUP_TREE_OBJECTID)
 			goto err;
 		break;
 	case BTRFS_ROOT_ITEM_KEY:
diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 5f7f101daab1..fcb8210984eb 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4365,7 +4365,7 @@ next:
 static int check_chunk_item(struct btrfs_fs_info *fs_info,
 			    struct extent_buffer *eb, int slot)
 {
-	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_root *root;
 	struct btrfs_root *dev_root = fs_info->dev_root;
 	struct btrfs_path path;
 	struct btrfs_key chunk_key;
@@ -4387,6 +4387,11 @@ static int check_chunk_item(struct btrfs_fs_info *fs_info,
 	int ret;
 	int err = 0;
 
+	if (btrfs_fs_incompat(fs_info, BG_TREE))
+		root = fs_info->bg_root;
+	else
+		root = fs_info->extent_root;
+
 	btrfs_item_key_to_cpu(eb, &chunk_key, slot);
 	chunk = btrfs_item_ptr(eb, slot, struct btrfs_chunk);
 	length = btrfs_chunk_length(eb, chunk);
@@ -4406,7 +4411,7 @@ static int check_chunk_item(struct btrfs_fs_info *fs_info,
 	bg_key.offset = length;
 
 	btrfs_init_path(&path);
-	ret = btrfs_search_slot(NULL, extent_root, &bg_key, &path, 0, 0);
+	ret = btrfs_search_slot(NULL, root, &bg_key, &path, 0, 0);
 	if (ret) {
 		error(
 		"chunk[%llu %llu) did not find the related block group item",
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 7/7] btrfs-progs: btrfstune: Allow to enable bg-tree feature offline
  2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
                   ` (5 preceding siblings ...)
  2019-10-08  4:49 ` [PATCH v2 6/7] btrfs-progs: check: Introduce support for bg-tree feature Qu Wenruo
@ 2019-10-08  4:49 ` Qu Wenruo
  2019-10-17  4:17   ` Anand Jain
  2019-10-14 15:17 ` [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature David Sterba
  7 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  4:49 UTC (permalink / raw)
  To: linux-btrfs

Add a new option '-b' for btrfstune, to enable bg-tree feature for a
unmounted fs.

This feature will convert all BLOCK_GROUP_ITEMs in extent tree to bg
tree, by reusing the existing btrfs_convert_to_bg_tree() function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 Documentation/btrfstune.asciidoc |  6 +++++
 btrfstune.c                      | 44 ++++++++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/Documentation/btrfstune.asciidoc b/Documentation/btrfstune.asciidoc
index 1d6bc98deed8..ed54c2e1597f 100644
--- a/Documentation/btrfstune.asciidoc
+++ b/Documentation/btrfstune.asciidoc
@@ -26,6 +26,12 @@ means.  Please refer to the 'FILESYSTEM FEATURES' in `btrfs`(5).
 OPTIONS
 -------
 
+-b::
+(since kernel: 5.x)
++
+enable bg-tree feature (faster mount time for large fs), enabled by mkfs
+feature 'bg-tree'.
+
 -f::
 Allow dangerous changes, e.g. clear the seeding flag or change fsid. Make sure
 that you are aware of the dangers.
diff --git a/btrfstune.c b/btrfstune.c
index afa3aae35412..aa1ac568aef0 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -476,11 +476,39 @@ static void print_usage(void)
 	printf("\t-m          change fsid in metadata_uuid to a random UUID\n");
 	printf("\t            (incompat change, more lightweight than -u|-U)\n");
 	printf("\t-M UUID     change fsid in metadata_uuid to UUID\n");
+	printf("\t-b          enable bg-tree feature (mkfs: bg-tree, for faster mount time)\n");
 	printf("  general:\n");
 	printf("\t-f          allow dangerous operations, make sure that you are aware of the dangers\n");
 	printf("\t--help      print this help\n");
 }
 
+static int convert_to_bg_tree(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	trans = btrfs_start_transaction(fs_info->tree_root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		errno = -ret;
+		error("failed to start transaction: %m");
+		return ret;
+	}
+	ret = btrfs_convert_to_bg_tree(trans);
+	if (ret < 0) {
+		errno = -ret;
+		error("failed to convert: %m");
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
+	ret = btrfs_commit_transaction(trans, fs_info->tree_root);
+	if (ret < 0) {
+		errno = -ret;
+		error("failed to commit transaction: %m");
+	}
+	return ret;
+}
+
 int BOX_MAIN(btrfstune)(int argc, char *argv[])
 {
 	struct btrfs_root *root;
@@ -491,6 +519,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
 	u64 seeding_value = 0;
 	int random_fsid = 0;
 	int change_metadata_uuid = 0;
+	bool to_bg_tree = false;
 	char *new_fsid_str = NULL;
 	int ret;
 	u64 super_flags = 0;
@@ -501,7 +530,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
 			{ "help", no_argument, NULL, GETOPT_VAL_HELP},
 			{ NULL, 0, NULL, 0 }
 		};
-		int c = getopt_long(argc, argv, "S:rxfuU:nmM:", long_options, NULL);
+		int c = getopt_long(argc, argv, "S:rxfuU:nmM:b", long_options, NULL);
 
 		if (c < 0)
 			break;
@@ -539,6 +568,9 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
 			ctree_flags |= OPEN_CTREE_IGNORE_FSID_MISMATCH;
 			change_metadata_uuid = 1;
 			break;
+		case 'b':
+			to_bg_tree = true;
+			break;
 		case GETOPT_VAL_HELP:
 		default:
 			print_usage();
@@ -556,7 +588,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
 		return 1;
 	}
 	if (!super_flags && !seeding_flag && !(random_fsid || new_fsid_str) &&
-	    !change_metadata_uuid) {
+	    !change_metadata_uuid && !to_bg_tree) {
 		error("at least one option should be specified");
 		print_usage();
 		return 1;
@@ -602,6 +634,14 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
 		return 1;
 	}
 
+	if (to_bg_tree) {
+		ret = convert_to_bg_tree(root->fs_info);
+		if (ret < 0) {
+			errno = -ret;
+			error("failed to convert to bg-tree feature: %m");
+			goto out;
+		}
+	}
 	if (seeding_flag) {
 		if (btrfs_fs_incompat(root->fs_info, METADATA_UUID)) {
 			fprintf(stderr, "SEED flag cannot be changed on a metadata-uuid changed fs\n");
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2.1 4/7] btrfs-progs: mkfs: Introduce -O bg-tree
  2019-10-08  4:49 ` [PATCH v2 4/7] btrfs-progs: mkfs: Introduce -O bg-tree Qu Wenruo
@ 2019-10-08  8:16   ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2019-10-08  8:16 UTC (permalink / raw)
  To: linux-btrfs

This allow mkfs.btrfs to create a btrfs with bg-tree feature.

This patch introduce a global function, btrfs_convert_to_bg_tree() in
extent-tree.c, to do the work.

The workflow is pretty simple:
- Create a new tree block for bg tree
- Set the BG_TREE feature for superblock
- Set the fs_info->convert_to_bg_tree flag
- Mark all block group items as dirty
- Commit transaction
  * With fs_info->convert_to_bg_tree set, we will try to delete the
    BLOCK_GROUP_ITEM in extent tree first, then write the new
    BLOCK_GROUP_ITEM into bg tree.

This btrfs_convert_to_bg_tree() will be used in mkfs after the basic fs
is created.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2.1->v2:
- Convert to bg tree after cleaning up temp chunks
  Exposed by btrfs/157 where a test case grep doesn't work as intented,
  due to SINGLE chunks.
---
 common/fsfeatures.c |  6 ++++++
 ctree.h             |  1 +
 extent-tree.c       | 38 ++++++++++++++++++++++++++++++++++++++
 mkfs/common.c       |  6 ++++--
 mkfs/main.c         | 25 +++++++++++++++++++++++++
 transaction.c       |  1 +
 6 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 50934bd161b0..b9bd70a4b3b6 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -86,6 +86,12 @@ static const struct btrfs_fs_feature {
 		VERSION_TO_STRING2(4,0),
 		NULL, 0,
 		"no explicit hole extents for files" },
+	{ "bg-tree", BTRFS_FEATURE_INCOMPAT_BG_TREE,
+		"bg_tree",
+		VERSION_TO_STRING2(5, 0),
+		NULL, 0,
+		NULL, 0,
+		"store block group items in dedicated tree" },
 	/* Keep this one last */
 	{ "list-all", BTRFS_FEATURE_LIST_ALL, NULL }
 };
diff --git a/ctree.h b/ctree.h
index c2a18c8ab72f..3d3992487a53 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2856,5 +2856,6 @@ int btrfs_read_file(struct btrfs_root *root, u64 ino, u64 start, int len,
 
 /* extent-tree.c */
 int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unsigned long nr);
+int btrfs_convert_to_bg_tree(struct btrfs_trans_handle *trans);
 
 #endif
diff --git a/extent-tree.c b/extent-tree.c
index cb3d7a1add0f..87550ef80e37 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1524,6 +1524,44 @@ int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	return __btrfs_mod_ref(trans, root, buf, record_parent, 0);
 }
 
+int btrfs_convert_to_bg_tree(struct btrfs_trans_handle *trans)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_block_group_cache *bg;
+	struct btrfs_root *bg_root;
+	u64 features = btrfs_super_incompat_flags(fs_info->super_copy);
+	int ret;
+
+	/* create bg tree first */
+	bg_root = btrfs_create_tree(trans, fs_info, BTRFS_BLOCK_GROUP_TREE_OBJECTID);
+	if (IS_ERR(bg_root)) {
+		ret = PTR_ERR(bg_root);
+		errno = -ret;
+		error("failed to create bg tree: %m");
+		return ret;
+	}
+	fs_info->bg_root = bg_root;
+	fs_info->bg_root->track_dirty = 1;
+	fs_info->bg_root->ref_cows = 0;
+
+	/* set BG_TREE feature and mark the fs into bg_tree convert status */
+	btrfs_set_super_incompat_flags(fs_info->super_copy,
+			features | BTRFS_FEATURE_INCOMPAT_BG_TREE);
+	fs_info->convert_to_bg_tree = 1;
+
+	/*
+	 * Mark all block groups dirty so they will get converted to bg tree at
+	 * commit transaction time
+	 */
+	for (bg = btrfs_lookup_first_block_group(fs_info, 0); bg;
+	     bg = btrfs_lookup_first_block_group(fs_info,
+				bg->key.objectid + bg->key.offset))
+		set_extent_bits(&fs_info->block_group_cache, bg->key.objectid,
+				bg->key.objectid + bg->key.offset - 1,
+				BLOCK_GROUP_DIRTY);
+	return 0;
+}
+
 static int write_one_cache_group(struct btrfs_trans_handle *trans,
 				 struct btrfs_path *path,
 				 struct btrfs_block_group_cache *cache)
diff --git a/mkfs/common.c b/mkfs/common.c
index caca5e707233..876193838612 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -111,6 +111,9 @@ static int btrfs_create_tree_root(int fd, struct btrfs_mkfs_config *cfg,
 	return ret;
 }
 
+/* These features will not be set in the temporary fs */
+#define MASKED_FEATURES		(~(BTRFS_FEATURE_INCOMPAT_BG_TREE))
+
 /*
  * @fs_uuid - if NULL, generates a UUID, returns back the new filesystem UUID
  *
@@ -204,7 +207,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_super_csum_type(&super, BTRFS_CSUM_TYPE_CRC32);
 	btrfs_set_super_chunk_root_generation(&super, 1);
 	btrfs_set_super_cache_generation(&super, -1);
-	btrfs_set_super_incompat_flags(&super, cfg->features);
+	btrfs_set_super_incompat_flags(&super, cfg->features & MASKED_FEATURES);
 	if (cfg->label)
 		__strncpy_null(super.label, cfg->label, BTRFS_LABEL_SIZE - 1);
 
@@ -824,4 +827,3 @@ int test_minimum_size(const char *file, u64 min_dev_size)
 	return 0;
 }
 
-
diff --git a/mkfs/main.c b/mkfs/main.c
index b752da13aba9..55bc4288dc08 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1312,6 +1312,31 @@ raid_groups:
 		goto out;
 	}
 
+	/*
+	 * Bg tree are converted after temp chunks cleaned up, or we can
+	 * populate temp chunks.
+	 */
+	if (mkfs_cfg.features & BTRFS_FEATURE_INCOMPAT_BG_TREE) {
+		trans = btrfs_start_transaction(fs_info->tree_root, 1);
+		if (IS_ERR(trans)) {
+			error("failed to start transaction: %d", ret);
+			goto out;
+		}
+		ret = btrfs_convert_to_bg_tree(trans);
+		if (ret < 0) {
+			errno = -ret;
+			error(
+		"bg-tree feature will not be enabled, due to error: %m");
+			btrfs_abort_transaction(trans, ret);
+			goto out;
+		}
+		ret = btrfs_commit_transaction(trans, fs_info->tree_root);
+		if (ret < 0) {
+			error("failed to commit transaction: %d", ret);
+			goto out;
+		}
+	}
+
 	if (source_dir_set) {
 		ret = btrfs_mkfs_fill_dir(source_dir, root, verbose);
 		if (ret) {
diff --git a/transaction.c b/transaction.c
index 45bb9e1f9de6..5de967fb015f 100644
--- a/transaction.c
+++ b/transaction.c
@@ -225,6 +225,7 @@ commit_tree:
 	root->commit_root = NULL;
 	fs_info->running_transaction = NULL;
 	fs_info->last_trans_committed = transid;
+	fs_info->convert_to_bg_tree = 0;
 	list_for_each_entry(sinfo, &fs_info->space_info, list) {
 		if (sinfo->bytes_reserved) {
 			warning(
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info
  2019-10-08  4:49 ` [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info Qu Wenruo
@ 2019-10-08  9:22   ` Johannes Thumshirn
  2019-10-17  2:16   ` Anand Jain
  1 sibling, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2019-10-08  9:22 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
                   ` (6 preceding siblings ...)
  2019-10-08  4:49 ` [PATCH v2 7/7] btrfs-progs: btrfstune: Allow to enable bg-tree feature offline Qu Wenruo
@ 2019-10-14 15:17 ` David Sterba
  2019-10-15  0:32   ` Qu Wenruo
  7 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2019-10-14 15:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Oct 08, 2019 at 12:49:29PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/btrfs-progs/tree/bg_tree
> Which is based on v5.2.2 tag.
> 
> This patchset provides the needed user space infrastructure for BG_TREE
> feature.
> 
> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
> is needed to convert existing fs (unmounted) to new format.
> 
> Now btrfstune can convert regular extent tree fs to bg tree fs to
> improve mount time.

Have we settled the argument whether to use a new tree or key tricks for
the blocgroup data? I think we have not and will read the previous
discussions. For a feature like this I want to be sure we understand all
the pros and cons.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-14 15:17 ` [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature David Sterba
@ 2019-10-15  0:32   ` Qu Wenruo
  2019-10-16 11:16     ` David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-15  0:32 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1585 bytes --]



On 2019/10/14 下午11:17, David Sterba wrote:
> On Tue, Oct 08, 2019 at 12:49:29PM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/btrfs-progs/tree/bg_tree
>> Which is based on v5.2.2 tag.
>>
>> This patchset provides the needed user space infrastructure for BG_TREE
>> feature.
>>
>> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
>> is needed to convert existing fs (unmounted) to new format.
>>
>> Now btrfstune can convert regular extent tree fs to bg tree fs to
>> improve mount time.
> 
> Have we settled the argument whether to use a new tree or key tricks for
> the blocgroup data? I think we have not and will read the previous
> discussions. For a feature like this I want to be sure we understand all
> the pros and cons.
> 
Yep, we haven't settled on the whether creating a new tree, or
re-organize the keys.

But as my last discussion said, I see no obvious pro using the existing
extent tree to hold the new block group item keys, even we can pack them
all together.

And for backup roots, indeed I forgot to add this feature.
But to me that's a minor point, not a show stopper.

The most important aspect to me is, to allow real world user of super
large fs to try this feature, to prove the usefulness of this design,
other than my on-paper analyse.

That's why I'm pushing the patchset, even it may not pass any review.
I just want to hold a up-to-date branch so that when some one needs, it
can grab and try them themselves.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-15  0:32   ` Qu Wenruo
@ 2019-10-16 11:16     ` David Sterba
  2019-10-16 11:19       ` Qu WenRuo
  0 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2019-10-16 11:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Tue, Oct 15, 2019 at 08:32:30AM +0800, Qu Wenruo wrote:
> > Have we settled the argument whether to use a new tree or key tricks for
> > the blocgroup data? I think we have not and will read the previous
> > discussions. For a feature like this I want to be sure we understand all
> > the pros and cons.
> > 
> Yep, we haven't settled on the whether creating a new tree, or
> re-organize the keys.
> 
> But as my last discussion said, I see no obvious pro using the existing
> extent tree to hold the new block group item keys, even we can pack them
> all together.

For me the obvious pro is minimum change to existing set of trees.

> And for backup roots, indeed I forgot to add this feature.
> But to me that's a minor point, not a show stopper.
>
> The most important aspect to me is, to allow real world user of super
> large fs to try this feature, to prove the usefulness of this design,
> other than my on-paper analyse.
> 
> That's why I'm pushing the patchset, even it may not pass any review.
> I just want to hold a up-to-date branch so that when some one needs, it
> can grab and try them themselves.

Ok that's fine and I can add the branch to for-next for ease of testing.
I'm working on a prototype that does it the bg item key way, it compiles
and creates almost correct filesystem, so I have to fix it before
posting. The patches are on top of your bg-tree feature so we could have
both in the same kernel for testing.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-16 11:16     ` David Sterba
@ 2019-10-16 11:19       ` Qu WenRuo
  2019-10-18 17:27         ` David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Qu WenRuo @ 2019-10-16 11:19 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2019/10/16 下午7:16, David Sterba wrote:
> On Tue, Oct 15, 2019 at 08:32:30AM +0800, Qu Wenruo wrote:
>>> Have we settled the argument whether to use a new tree or key tricks for
>>> the blocgroup data? I think we have not and will read the previous
>>> discussions. For a feature like this I want to be sure we understand all
>>> the pros and cons.
>>>
>> Yep, we haven't settled on the whether creating a new tree, or
>> re-organize the keys.
>>
>> But as my last discussion said, I see no obvious pro using the existing
>> extent tree to hold the new block group item keys, even we can pack them
>> all together.
> 
> For me the obvious pro is minimum change to existing set of trees.

That's interesting.

And indeed, since we're dealing one less tree, there is no chance to
cause the bug mentioned by Josef.
> 
>> And for backup roots, indeed I forgot to add this feature.
>> But to me that's a minor point, not a show stopper.
>>
>> The most important aspect to me is, to allow real world user of super
>> large fs to try this feature, to prove the usefulness of this design,
>> other than my on-paper analyse.
>>
>> That's why I'm pushing the patchset, even it may not pass any review.
>> I just want to hold a up-to-date branch so that when some one needs, it
>> can grab and try them themselves.
> 
> Ok that's fine and I can add the branch to for-next for ease of testing.
> I'm working on a prototype that does it the bg item key way, it compiles
> and creates almost correct filesystem, so I have to fix it before
> posting. The patches are on top of your bg-tree feature so we could have
> both in the same kernel for testing.

That's great!

As long as we're pushing a solution to the mount time problem, I can't
be more happier!

Then I guess no matter which version get merged to upstream, the
patchset is already meaningful.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info
  2019-10-08  4:49 ` [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info Qu Wenruo
  2019-10-08  9:22   ` Johannes Thumshirn
@ 2019-10-17  2:16   ` Anand Jain
  1 sibling, 0 replies; 28+ messages in thread
From: Anand Jain @ 2019-10-17  2:16 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 10/8/19 12:49 PM, Qu Wenruo wrote:
> The following functions are just using @root to reach fs_info:
> - exclude_super_stripes
> - free_excluded_extents
> - add_excluded_extent
> 
> Refactor them to use fs_info directly.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Anand Jain <anand.jain@oracle.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/7] btrfs-progs: Refactor btrfs_read_block_groups()
  2019-10-08  4:49 ` [PATCH v2 2/7] btrfs-progs: Refactor btrfs_read_block_groups() Qu Wenruo
@ 2019-10-17  3:23   ` Anand Jain
  2019-10-17  4:33     ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Anand Jain @ 2019-10-17  3:23 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 10/8/19 12:49 PM, Qu Wenruo wrote:
> This patch does the following refactor:
> - Refactor parameter from @root to @fs_info
> 
> - Refactor the large loop body into another function
>    Now we have a helper function, read_one_block_group(), to handle
>    block group cache and space info related routine.
> 
> - Refactor the return value
>    Even we have the code handling ret > 0 from find_first_block_group(),
>    it never works, as when there is no more block group,
>    find_first_block_group() just return -ENOENT other than 1.


  Can it be separated into patches? My concern is as it alters the return
  value of the rescue command. So we shall have clarity of a discrete
  patch to blame. Otherwise I agree its a good change.


>    This is super confusing, it's almost a mircle it even works.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   ctree.h       |   2 +-
>   disk-io.c     |   9 ++-
>   extent-tree.c | 160 ++++++++++++++++++++++++++++----------------------
>   3 files changed, 97 insertions(+), 74 deletions(-)
> 
> diff --git a/ctree.h b/ctree.h
> index 8c7b3cb40151..2899de358613 100644
> --- a/ctree.h
> +++ b/ctree.h
> @@ -2550,7 +2550,7 @@ int update_space_info(struct btrfs_fs_info *info, u64 flags,
>   		      u64 total_bytes, u64 bytes_used,
>   		      struct btrfs_space_info **space_info);
>   int btrfs_free_block_groups(struct btrfs_fs_info *info);
> -int btrfs_read_block_groups(struct btrfs_root *root);
> +int btrfs_read_block_groups(struct btrfs_fs_info *info);
>   struct btrfs_block_group_cache *
>   btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
>   		      u64 chunk_offset, u64 size);
> diff --git a/disk-io.c b/disk-io.c
> index be44eead5cef..8978f0cb60c7 100644
> --- a/disk-io.c
> +++ b/disk-io.c
> @@ -983,14 +983,17 @@ int btrfs_setup_all_roots(struct btrfs_fs_info *fs_info, u64 root_tree_bytenr,
>   	fs_info->last_trans_committed = generation;
>   	if (extent_buffer_uptodate(fs_info->extent_root->node) &&
>   	    !(flags & OPEN_CTREE_NO_BLOCK_GROUPS)) {
> -		ret = btrfs_read_block_groups(fs_info->tree_root);
> +		ret = btrfs_read_block_groups(fs_info);
>   		/*
>   		 * If we don't find any blockgroups (ENOENT) we're either
>   		 * restoring or creating the filesystem, where it's expected,
>   		 * anything else is error
>   		 */
> -		if (ret != -ENOENT)
> -			return -EIO;
> +		if (ret < 0 && ret != -ENOENT) {
> +			errno = -ret;
> +			error("failed to read block groups: %m");
> +			return ret;
> +		}
>   	}


As mentioned this alters the rescue command semantics as show below.
Earlier we had only -EIO as error now its much more and accurate
which is good. fstests is fine but anything else?

cmd_rescue_chunk_recover()
   btrfs_recover_chunk_tree()
     open_ctree_with_broken_chunk()
       btrfs_setup_all_roots()

Thanks, Anand

>   	key.objectid = BTRFS_FS_TREE_OBJECTID;
> diff --git a/extent-tree.c b/extent-tree.c
> index 19d1ea0df570..9713d627764c 100644
> --- a/extent-tree.c
> +++ b/extent-tree.c
> @@ -2607,6 +2607,13 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>   	return 0;
>   }
>   
> +/*
> + * Find a block group which starts >= @key->objectid in extent tree.
> + *
> + * Return 0 for found
> + * Retrun >0 for not found
> + * Return <0 for error
> + */
>   static int find_first_block_group(struct btrfs_root *root,
>   		struct btrfs_path *path, struct btrfs_key *key)
>   {
> @@ -2636,36 +2643,95 @@ static int find_first_block_group(struct btrfs_root *root,
>   			return 0;
>   		path->slots[0]++;
>   	}
> -	ret = -ENOENT;
> +	ret = 1;
>   error:
>   	return ret;
>   }
>   
> -int btrfs_read_block_groups(struct btrfs_root *root)
> +/*
> + * Helper function to read out one BLOCK_GROUP_ITEM and insert it into
> + * block group cache.
> + *
> + * Return 0 if nothing wrong (either insert the bg cache or skip 0 sized bg)
> + * Return <0 for error.
> + */
> +static int read_one_block_group(struct btrfs_fs_info *fs_info,
> +				 struct btrfs_path *path)
>   {
> -	struct btrfs_path *path;
> -	int ret;
> -	int bit;
> -	struct btrfs_block_group_cache *cache;
> -	struct btrfs_fs_info *info = root->fs_info;
> +	struct extent_io_tree *block_group_cache = &fs_info->block_group_cache;
> +	struct extent_buffer *leaf = path->nodes[0];
>   	struct btrfs_space_info *space_info;
> -	struct extent_io_tree *block_group_cache;
> +	struct btrfs_block_group_cache *cache;
>   	struct btrfs_key key;
> -	struct btrfs_key found_key;
> -	struct extent_buffer *leaf;
> +	int slot = path->slots[0];
> +	int bit = 0;
> +	int ret;
>   
> -	block_group_cache = &info->block_group_cache;
> +	btrfs_item_key_to_cpu(leaf, &key, slot);
> +	ASSERT(key.type == BTRFS_BLOCK_GROUP_ITEM_KEY);
> +
> +	/*
> +	 * Skip 0 sized block group, don't insert them into block
> +	 * group cache tree, as its length is 0, it won't get
> +	 * freed at close_ctree() time.
> +	 */
> +	if (key.offset == 0)
> +		return 0;
> +
> +	cache = kzalloc(sizeof(*cache), GFP_NOFS);
> +	if (!cache)
> +		return -ENOMEM;
> +	read_extent_buffer(leaf, &cache->item,
> +			   btrfs_item_ptr_offset(leaf, slot),
> +			   sizeof(cache->item));
> +	memcpy(&cache->key, &key, sizeof(key));
> +	cache->cached = 0;
> +	cache->pinned = 0;
> +	cache->flags = btrfs_block_group_flags(&cache->item);
> +	if (cache->flags & BTRFS_BLOCK_GROUP_DATA) {
> +		bit = BLOCK_GROUP_DATA;
> +	} else if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
> +		bit = BLOCK_GROUP_SYSTEM;
> +	} else if (cache->flags & BTRFS_BLOCK_GROUP_METADATA) {
> +		bit = BLOCK_GROUP_METADATA;
> +	}
> +	set_avail_alloc_bits(fs_info, cache->flags);
> +	if (btrfs_chunk_readonly(fs_info, cache->key.objectid))
> +		cache->ro = 1;
> +	exclude_super_stripes(fs_info, cache);
> +
> +	ret = update_space_info(fs_info, cache->flags, cache->key.offset,
> +				btrfs_block_group_used(&cache->item),
> +				&space_info);
> +	if (ret < 0) {
> +		free(cache);
> +		return ret;
> +	}
> +	cache->space_info = space_info;
> +
> +	set_extent_bits(block_group_cache, cache->key.objectid,
> +			cache->key.objectid + cache->key.offset - 1,
> +			bit | EXTENT_LOCKED);
> +	set_state_private(block_group_cache, cache->key.objectid,
> +			  (unsigned long)cache);
> +	return 0;
> +}
>   
> -	root = info->extent_root;
> +int btrfs_read_block_groups(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_path path;
> +	struct btrfs_root *root;
> +	int ret;
> +	struct btrfs_key key;
> +
> +	root = fs_info->extent_root;
>   	key.objectid = 0;
>   	key.offset = 0;
>   	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> -	path = btrfs_alloc_path();
> -	if (!path)
> -		return -ENOMEM;
> +	btrfs_init_path(&path);
>   
>   	while(1) {
> -		ret = find_first_block_group(root, path, &key);
> +		ret = find_first_block_group(root, &path, &key);
>   		if (ret > 0) {
>   			ret = 0;
>   			goto error;
> @@ -2673,67 +2739,21 @@ int btrfs_read_block_groups(struct btrfs_root *root)
>   		if (ret != 0) {
>   			goto error;
>   		}
> -		leaf = path->nodes[0];
> -		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> +		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
>   
> -		cache = kzalloc(sizeof(*cache), GFP_NOFS);
> -		if (!cache) {
> -			ret = -ENOMEM;
> +		ret = read_one_block_group(fs_info, &path);
> +		if (ret < 0)
>   			goto error;
> -		}
>   
> -		read_extent_buffer(leaf, &cache->item,
> -				   btrfs_item_ptr_offset(leaf, path->slots[0]),
> -				   sizeof(cache->item));
> -		memcpy(&cache->key, &found_key, sizeof(found_key));
> -		cache->cached = 0;
> -		cache->pinned = 0;
> -		key.objectid = found_key.objectid + found_key.offset;
> -		if (found_key.offset == 0)
> +		if (key.offset == 0)
>   			key.objectid++;
> -		btrfs_release_path(path);
> -
> -		/*
> -		 * Skip 0 sized block group, don't insert them into block
> -		 * group cache tree, as its length is 0, it won't get
> -		 * freed at close_ctree() time.
> -		 */
> -		if (found_key.offset == 0) {
> -			free(cache);
> -			continue;
> -		}
> -
> -		cache->flags = btrfs_block_group_flags(&cache->item);
> -		bit = 0;
> -		if (cache->flags & BTRFS_BLOCK_GROUP_DATA) {
> -			bit = BLOCK_GROUP_DATA;
> -		} else if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
> -			bit = BLOCK_GROUP_SYSTEM;
> -		} else if (cache->flags & BTRFS_BLOCK_GROUP_METADATA) {
> -			bit = BLOCK_GROUP_METADATA;
> -		}
> -		set_avail_alloc_bits(info, cache->flags);
> -		if (btrfs_chunk_readonly(info, cache->key.objectid))
> -			cache->ro = 1;
> -
> -		exclude_super_stripes(info, cache);
> -
> -		ret = update_space_info(info, cache->flags, found_key.offset,
> -					btrfs_block_group_used(&cache->item),
> -					&space_info);
> -		BUG_ON(ret);
> -		cache->space_info = space_info;
> -
> -		/* use EXTENT_LOCKED to prevent merging */
> -		set_extent_bits(block_group_cache, found_key.objectid,
> -				found_key.objectid + found_key.offset - 1,
> -				bit | EXTENT_LOCKED);
> -		set_state_private(block_group_cache, found_key.objectid,
> -				  (unsigned long)cache);
> +		else
> +			key.objectid = key.objectid + key.offset;
> +		btrfs_release_path(&path);
>   	}
>   	ret = 0;
>   error:
> -	btrfs_free_path(path);
> +	btrfs_release_path(&path);
>   	return ret;
>   }
>   
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 7/7] btrfs-progs: btrfstune: Allow to enable bg-tree feature offline
  2019-10-08  4:49 ` [PATCH v2 7/7] btrfs-progs: btrfstune: Allow to enable bg-tree feature offline Qu Wenruo
@ 2019-10-17  4:17   ` Anand Jain
  2019-10-17  4:28     ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Anand Jain @ 2019-10-17  4:17 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


  Depending on the size of the FS the convert may take longer, further
  fatal error (power loss; pid kill) may leave the FS in a state where
  the bg items are in both extent-tree and bg-tree.

  The lessons which lead to the implementation of metadata_uuid fsid
  suggests, for conversions its better to use the btrfstune to only
  flag the bg convert requirement and let the kernel handle of migration
  of the bg items from the extent-tree to the bg-tree as and when the
  bg-items are written.

Thanks, Anand


On 10/8/19 12:49 PM, Qu Wenruo wrote:
> Add a new option '-b' for btrfstune, to enable bg-tree feature for a
> unmounted fs.
> 
> This feature will convert all BLOCK_GROUP_ITEMs in extent tree to bg
> tree, by reusing the existing btrfs_convert_to_bg_tree() function.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   Documentation/btrfstune.asciidoc |  6 +++++
>   btrfstune.c                      | 44 ++++++++++++++++++++++++++++++--
>   2 files changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/btrfstune.asciidoc b/Documentation/btrfstune.asciidoc
> index 1d6bc98deed8..ed54c2e1597f 100644
> --- a/Documentation/btrfstune.asciidoc
> +++ b/Documentation/btrfstune.asciidoc
> @@ -26,6 +26,12 @@ means.  Please refer to the 'FILESYSTEM FEATURES' in `btrfs`(5).
>   OPTIONS
>   -------
>   
> +-b::
> +(since kernel: 5.x)
> ++
> +enable bg-tree feature (faster mount time for large fs), enabled by mkfs
> +feature 'bg-tree'.
> +
>   -f::
>   Allow dangerous changes, e.g. clear the seeding flag or change fsid. Make sure
>   that you are aware of the dangers.
> diff --git a/btrfstune.c b/btrfstune.c
> index afa3aae35412..aa1ac568aef0 100644
> --- a/btrfstune.c
> +++ b/btrfstune.c
> @@ -476,11 +476,39 @@ static void print_usage(void)
>   	printf("\t-m          change fsid in metadata_uuid to a random UUID\n");
>   	printf("\t            (incompat change, more lightweight than -u|-U)\n");
>   	printf("\t-M UUID     change fsid in metadata_uuid to UUID\n");
> +	printf("\t-b          enable bg-tree feature (mkfs: bg-tree, for faster mount time)\n");
>   	printf("  general:\n");
>   	printf("\t-f          allow dangerous operations, make sure that you are aware of the dangers\n");
>   	printf("\t--help      print this help\n");
>   }
>   
> +static int convert_to_bg_tree(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_trans_handle *trans;
> +	int ret;
> +
> +	trans = btrfs_start_transaction(fs_info->tree_root, 1);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		errno = -ret;
> +		error("failed to start transaction: %m");
> +		return ret;
> +	}
> +	ret = btrfs_convert_to_bg_tree(trans);
> +	if (ret < 0) {
> +		errno = -ret;
> +		error("failed to convert: %m");
> +		btrfs_abort_transaction(trans, ret);
> +		return ret;
> +	}
> +	ret = btrfs_commit_transaction(trans, fs_info->tree_root);
> +	if (ret < 0) {
> +		errno = -ret;
> +		error("failed to commit transaction: %m");
> +	}
> +	return ret;
> +}
> +
>   int BOX_MAIN(btrfstune)(int argc, char *argv[])
>   {
>   	struct btrfs_root *root;
> @@ -491,6 +519,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>   	u64 seeding_value = 0;
>   	int random_fsid = 0;
>   	int change_metadata_uuid = 0;
> +	bool to_bg_tree = false;
>   	char *new_fsid_str = NULL;
>   	int ret;
>   	u64 super_flags = 0;
> @@ -501,7 +530,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>   			{ "help", no_argument, NULL, GETOPT_VAL_HELP},
>   			{ NULL, 0, NULL, 0 }
>   		};
> -		int c = getopt_long(argc, argv, "S:rxfuU:nmM:", long_options, NULL);
> +		int c = getopt_long(argc, argv, "S:rxfuU:nmM:b", long_options, NULL);
>   
>   		if (c < 0)
>   			break;
> @@ -539,6 +568,9 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>   			ctree_flags |= OPEN_CTREE_IGNORE_FSID_MISMATCH;
>   			change_metadata_uuid = 1;
>   			break;
> +		case 'b':
> +			to_bg_tree = true;
> +			break;
>   		case GETOPT_VAL_HELP:
>   		default:
>   			print_usage();
> @@ -556,7 +588,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>   		return 1;
>   	}
>   	if (!super_flags && !seeding_flag && !(random_fsid || new_fsid_str) &&
> -	    !change_metadata_uuid) {
> +	    !change_metadata_uuid && !to_bg_tree) {
>   		error("at least one option should be specified");
>   		print_usage();
>   		return 1;
> @@ -602,6 +634,14 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>   		return 1;
>   	}
>   
> +	if (to_bg_tree) {
> +		ret = convert_to_bg_tree(root->fs_info);
> +		if (ret < 0) {
> +			errno = -ret;
> +			error("failed to convert to bg-tree feature: %m");
> +			goto out;
> +		}
> +	}
>   	if (seeding_flag) {
>   		if (btrfs_fs_incompat(root->fs_info, METADATA_UUID)) {
>   			fprintf(stderr, "SEED flag cannot be changed on a metadata-uuid changed fs\n");
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 7/7] btrfs-progs: btrfstune: Allow to enable bg-tree feature offline
  2019-10-17  4:17   ` Anand Jain
@ 2019-10-17  4:28     ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2019-10-17  4:28 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6254 bytes --]



On 2019/10/17 下午12:17, Anand Jain wrote:
> 
>  Depending on the size of the FS the convert may take longer, further
>  fatal error (power loss; pid kill) may leave the FS in a state where
>  the bg items are in both extent-tree and bg-tree.

That's why I'm using one transaction to convert them all.

So if the convert get interrutped, we're still safe.

Thanks,
Qu
> 
>  The lessons which lead to the implementation of metadata_uuid fsid
>  suggests, for conversions its better to use the btrfstune to only
>  flag the bg convert requirement and let the kernel handle of migration
>  of the bg items from the extent-tree to the bg-tree as and when the
>  bg-items are written.
> 
> Thanks, Anand
> 
> 
> On 10/8/19 12:49 PM, Qu Wenruo wrote:
>> Add a new option '-b' for btrfstune, to enable bg-tree feature for a
>> unmounted fs.
>>
>> This feature will convert all BLOCK_GROUP_ITEMs in extent tree to bg
>> tree, by reusing the existing btrfs_convert_to_bg_tree() function.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   Documentation/btrfstune.asciidoc |  6 +++++
>>   btrfstune.c                      | 44 ++++++++++++++++++++++++++++++--
>>   2 files changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/btrfstune.asciidoc
>> b/Documentation/btrfstune.asciidoc
>> index 1d6bc98deed8..ed54c2e1597f 100644
>> --- a/Documentation/btrfstune.asciidoc
>> +++ b/Documentation/btrfstune.asciidoc
>> @@ -26,6 +26,12 @@ means.  Please refer to the 'FILESYSTEM FEATURES'
>> in `btrfs`(5).
>>   OPTIONS
>>   -------
>>   +-b::
>> +(since kernel: 5.x)
>> ++
>> +enable bg-tree feature (faster mount time for large fs), enabled by mkfs
>> +feature 'bg-tree'.
>> +
>>   -f::
>>   Allow dangerous changes, e.g. clear the seeding flag or change fsid.
>> Make sure
>>   that you are aware of the dangers.
>> diff --git a/btrfstune.c b/btrfstune.c
>> index afa3aae35412..aa1ac568aef0 100644
>> --- a/btrfstune.c
>> +++ b/btrfstune.c
>> @@ -476,11 +476,39 @@ static void print_usage(void)
>>       printf("\t-m          change fsid in metadata_uuid to a random
>> UUID\n");
>>       printf("\t            (incompat change, more lightweight than
>> -u|-U)\n");
>>       printf("\t-M UUID     change fsid in metadata_uuid to UUID\n");
>> +    printf("\t-b          enable bg-tree feature (mkfs: bg-tree, for
>> faster mount time)\n");
>>       printf("  general:\n");
>>       printf("\t-f          allow dangerous operations, make sure that
>> you are aware of the dangers\n");
>>       printf("\t--help      print this help\n");
>>   }
>>   +static int convert_to_bg_tree(struct btrfs_fs_info *fs_info)
>> +{
>> +    struct btrfs_trans_handle *trans;
>> +    int ret;
>> +
>> +    trans = btrfs_start_transaction(fs_info->tree_root, 1);
>> +    if (IS_ERR(trans)) {
>> +        ret = PTR_ERR(trans);
>> +        errno = -ret;
>> +        error("failed to start transaction: %m");
>> +        return ret;
>> +    }
>> +    ret = btrfs_convert_to_bg_tree(trans);
>> +    if (ret < 0) {
>> +        errno = -ret;
>> +        error("failed to convert: %m");
>> +        btrfs_abort_transaction(trans, ret);
>> +        return ret;
>> +    }
>> +    ret = btrfs_commit_transaction(trans, fs_info->tree_root);
>> +    if (ret < 0) {
>> +        errno = -ret;
>> +        error("failed to commit transaction: %m");
>> +    }
>> +    return ret;
>> +}
>> +
>>   int BOX_MAIN(btrfstune)(int argc, char *argv[])
>>   {
>>       struct btrfs_root *root;
>> @@ -491,6 +519,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>>       u64 seeding_value = 0;
>>       int random_fsid = 0;
>>       int change_metadata_uuid = 0;
>> +    bool to_bg_tree = false;
>>       char *new_fsid_str = NULL;
>>       int ret;
>>       u64 super_flags = 0;
>> @@ -501,7 +530,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>>               { "help", no_argument, NULL, GETOPT_VAL_HELP},
>>               { NULL, 0, NULL, 0 }
>>           };
>> -        int c = getopt_long(argc, argv, "S:rxfuU:nmM:", long_options,
>> NULL);
>> +        int c = getopt_long(argc, argv, "S:rxfuU:nmM:b",
>> long_options, NULL);
>>             if (c < 0)
>>               break;
>> @@ -539,6 +568,9 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>>               ctree_flags |= OPEN_CTREE_IGNORE_FSID_MISMATCH;
>>               change_metadata_uuid = 1;
>>               break;
>> +        case 'b':
>> +            to_bg_tree = true;
>> +            break;
>>           case GETOPT_VAL_HELP:
>>           default:
>>               print_usage();
>> @@ -556,7 +588,7 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>>           return 1;
>>       }
>>       if (!super_flags && !seeding_flag && !(random_fsid ||
>> new_fsid_str) &&
>> -        !change_metadata_uuid) {
>> +        !change_metadata_uuid && !to_bg_tree) {
>>           error("at least one option should be specified");
>>           print_usage();
>>           return 1;
>> @@ -602,6 +634,14 @@ int BOX_MAIN(btrfstune)(int argc, char *argv[])
>>           return 1;
>>       }
>>   +    if (to_bg_tree) {
>> +        ret = convert_to_bg_tree(root->fs_info);
>> +        if (ret < 0) {
>> +            errno = -ret;
>> +            error("failed to convert to bg-tree feature: %m");
>> +            goto out;
>> +        }
>> +    }
>>       if (seeding_flag) {
>>           if (btrfs_fs_incompat(root->fs_info, METADATA_UUID)) {
>>               fprintf(stderr, "SEED flag cannot be changed on a
>> metadata-uuid changed fs\n");
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/7] btrfs-progs: Refactor btrfs_read_block_groups()
  2019-10-17  3:23   ` Anand Jain
@ 2019-10-17  4:33     ` Qu Wenruo
  2019-10-17  5:08       ` Anand Jain
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-17  4:33 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 12403 bytes --]



On 2019/10/17 上午11:23, Anand Jain wrote:
> On 10/8/19 12:49 PM, Qu Wenruo wrote:
>> This patch does the following refactor:
>> - Refactor parameter from @root to @fs_info
>>
>> - Refactor the large loop body into another function
>>    Now we have a helper function, read_one_block_group(), to handle
>>    block group cache and space info related routine.
>>
>> - Refactor the return value
>>    Even we have the code handling ret > 0 from find_first_block_group(),
>>    it never works, as when there is no more block group,
>>    find_first_block_group() just return -ENOENT other than 1.
> 
> 
>  Can it be separated into patches? My concern is as it alters the return
>  value of the rescue command. So we shall have clarity of a discrete
>  patch to blame. Otherwise I agree its a good change.

No problem.

What about 3 patches split by the mentioned 3 refactors?
> 
> 
>>    This is super confusing, it's almost a mircle it even works.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   ctree.h       |   2 +-
>>   disk-io.c     |   9 ++-
>>   extent-tree.c | 160 ++++++++++++++++++++++++++++----------------------
>>   3 files changed, 97 insertions(+), 74 deletions(-)
>>
>> diff --git a/ctree.h b/ctree.h
>> index 8c7b3cb40151..2899de358613 100644
>> --- a/ctree.h
>> +++ b/ctree.h
>> @@ -2550,7 +2550,7 @@ int update_space_info(struct btrfs_fs_info
>> *info, u64 flags,
>>                 u64 total_bytes, u64 bytes_used,
>>                 struct btrfs_space_info **space_info);
>>   int btrfs_free_block_groups(struct btrfs_fs_info *info);
>> -int btrfs_read_block_groups(struct btrfs_root *root);
>> +int btrfs_read_block_groups(struct btrfs_fs_info *info);
>>   struct btrfs_block_group_cache *
>>   btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used,
>> u64 type,
>>                 u64 chunk_offset, u64 size);
>> diff --git a/disk-io.c b/disk-io.c
>> index be44eead5cef..8978f0cb60c7 100644
>> --- a/disk-io.c
>> +++ b/disk-io.c
>> @@ -983,14 +983,17 @@ int btrfs_setup_all_roots(struct btrfs_fs_info
>> *fs_info, u64 root_tree_bytenr,
>>       fs_info->last_trans_committed = generation;
>>       if (extent_buffer_uptodate(fs_info->extent_root->node) &&
>>           !(flags & OPEN_CTREE_NO_BLOCK_GROUPS)) {
>> -        ret = btrfs_read_block_groups(fs_info->tree_root);
>> +        ret = btrfs_read_block_groups(fs_info);
>>           /*
>>            * If we don't find any blockgroups (ENOENT) we're either
>>            * restoring or creating the filesystem, where it's expected,
>>            * anything else is error
>>            */
>> -        if (ret != -ENOENT)
>> -            return -EIO;
>> +        if (ret < 0 && ret != -ENOENT) {
>> +            errno = -ret;
>> +            error("failed to read block groups: %m");
>> +            return ret;
>> +        }
>>       }
> 
> 
> As mentioned this alters the rescue command semantics as show below.
> Earlier we had only -EIO as error now its much more and accurate
> which is good. fstests is fine but anything else?
> 
> cmd_rescue_chunk_recover()
>   btrfs_recover_chunk_tree()
>     open_ctree_with_broken_chunk()
>       btrfs_setup_all_roots()

I'm not sure if I got the point.

Although btrfs_setup_all_roots() get called in above call chain, it
doesn't have any special handling of -EIO or others.

It just reads the extent tree root.

Would you mind to explain a little more?

Thanks,
Qu


> 
> Thanks, Anand
> 
>>       key.objectid = BTRFS_FS_TREE_OBJECTID;
>> diff --git a/extent-tree.c b/extent-tree.c
>> index 19d1ea0df570..9713d627764c 100644
>> --- a/extent-tree.c
>> +++ b/extent-tree.c
>> @@ -2607,6 +2607,13 @@ int btrfs_free_block_groups(struct
>> btrfs_fs_info *info)
>>       return 0;
>>   }
>>   +/*
>> + * Find a block group which starts >= @key->objectid in extent tree.
>> + *
>> + * Return 0 for found
>> + * Retrun >0 for not found
>> + * Return <0 for error
>> + */
>>   static int find_first_block_group(struct btrfs_root *root,
>>           struct btrfs_path *path, struct btrfs_key *key)
>>   {
>> @@ -2636,36 +2643,95 @@ static int find_first_block_group(struct
>> btrfs_root *root,
>>               return 0;
>>           path->slots[0]++;
>>       }
>> -    ret = -ENOENT;
>> +    ret = 1;
>>   error:
>>       return ret;
>>   }
>>   -int btrfs_read_block_groups(struct btrfs_root *root)
>> +/*
>> + * Helper function to read out one BLOCK_GROUP_ITEM and insert it into
>> + * block group cache.
>> + *
>> + * Return 0 if nothing wrong (either insert the bg cache or skip 0
>> sized bg)
>> + * Return <0 for error.
>> + */
>> +static int read_one_block_group(struct btrfs_fs_info *fs_info,
>> +                 struct btrfs_path *path)
>>   {
>> -    struct btrfs_path *path;
>> -    int ret;
>> -    int bit;
>> -    struct btrfs_block_group_cache *cache;
>> -    struct btrfs_fs_info *info = root->fs_info;
>> +    struct extent_io_tree *block_group_cache =
>> &fs_info->block_group_cache;
>> +    struct extent_buffer *leaf = path->nodes[0];
>>       struct btrfs_space_info *space_info;
>> -    struct extent_io_tree *block_group_cache;
>> +    struct btrfs_block_group_cache *cache;
>>       struct btrfs_key key;
>> -    struct btrfs_key found_key;
>> -    struct extent_buffer *leaf;
>> +    int slot = path->slots[0];
>> +    int bit = 0;
>> +    int ret;
>>   -    block_group_cache = &info->block_group_cache;
>> +    btrfs_item_key_to_cpu(leaf, &key, slot);
>> +    ASSERT(key.type == BTRFS_BLOCK_GROUP_ITEM_KEY);
>> +
>> +    /*
>> +     * Skip 0 sized block group, don't insert them into block
>> +     * group cache tree, as its length is 0, it won't get
>> +     * freed at close_ctree() time.
>> +     */
>> +    if (key.offset == 0)
>> +        return 0;
>> +
>> +    cache = kzalloc(sizeof(*cache), GFP_NOFS);
>> +    if (!cache)
>> +        return -ENOMEM;
>> +    read_extent_buffer(leaf, &cache->item,
>> +               btrfs_item_ptr_offset(leaf, slot),
>> +               sizeof(cache->item));
>> +    memcpy(&cache->key, &key, sizeof(key));
>> +    cache->cached = 0;
>> +    cache->pinned = 0;
>> +    cache->flags = btrfs_block_group_flags(&cache->item);
>> +    if (cache->flags & BTRFS_BLOCK_GROUP_DATA) {
>> +        bit = BLOCK_GROUP_DATA;
>> +    } else if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
>> +        bit = BLOCK_GROUP_SYSTEM;
>> +    } else if (cache->flags & BTRFS_BLOCK_GROUP_METADATA) {
>> +        bit = BLOCK_GROUP_METADATA;
>> +    }
>> +    set_avail_alloc_bits(fs_info, cache->flags);
>> +    if (btrfs_chunk_readonly(fs_info, cache->key.objectid))
>> +        cache->ro = 1;
>> +    exclude_super_stripes(fs_info, cache);
>> +
>> +    ret = update_space_info(fs_info, cache->flags, cache->key.offset,
>> +                btrfs_block_group_used(&cache->item),
>> +                &space_info);
>> +    if (ret < 0) {
>> +        free(cache);
>> +        return ret;
>> +    }
>> +    cache->space_info = space_info;
>> +
>> +    set_extent_bits(block_group_cache, cache->key.objectid,
>> +            cache->key.objectid + cache->key.offset - 1,
>> +            bit | EXTENT_LOCKED);
>> +    set_state_private(block_group_cache, cache->key.objectid,
>> +              (unsigned long)cache);
>> +    return 0;
>> +}
>>   -    root = info->extent_root;
>> +int btrfs_read_block_groups(struct btrfs_fs_info *fs_info)
>> +{
>> +    struct btrfs_path path;
>> +    struct btrfs_root *root;
>> +    int ret;
>> +    struct btrfs_key key;
>> +
>> +    root = fs_info->extent_root;
>>       key.objectid = 0;
>>       key.offset = 0;
>>       key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>> -    path = btrfs_alloc_path();
>> -    if (!path)
>> -        return -ENOMEM;
>> +    btrfs_init_path(&path);
>>         while(1) {
>> -        ret = find_first_block_group(root, path, &key);
>> +        ret = find_first_block_group(root, &path, &key);
>>           if (ret > 0) {
>>               ret = 0;
>>               goto error;
>> @@ -2673,67 +2739,21 @@ int btrfs_read_block_groups(struct btrfs_root
>> *root)
>>           if (ret != 0) {
>>               goto error;
>>           }
>> -        leaf = path->nodes[0];
>> -        btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
>> +        btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
>>   -        cache = kzalloc(sizeof(*cache), GFP_NOFS);
>> -        if (!cache) {
>> -            ret = -ENOMEM;
>> +        ret = read_one_block_group(fs_info, &path);
>> +        if (ret < 0)
>>               goto error;
>> -        }
>>   -        read_extent_buffer(leaf, &cache->item,
>> -                   btrfs_item_ptr_offset(leaf, path->slots[0]),
>> -                   sizeof(cache->item));
>> -        memcpy(&cache->key, &found_key, sizeof(found_key));
>> -        cache->cached = 0;
>> -        cache->pinned = 0;
>> -        key.objectid = found_key.objectid + found_key.offset;
>> -        if (found_key.offset == 0)
>> +        if (key.offset == 0)
>>               key.objectid++;
>> -        btrfs_release_path(path);
>> -
>> -        /*
>> -         * Skip 0 sized block group, don't insert them into block
>> -         * group cache tree, as its length is 0, it won't get
>> -         * freed at close_ctree() time.
>> -         */
>> -        if (found_key.offset == 0) {
>> -            free(cache);
>> -            continue;
>> -        }
>> -
>> -        cache->flags = btrfs_block_group_flags(&cache->item);
>> -        bit = 0;
>> -        if (cache->flags & BTRFS_BLOCK_GROUP_DATA) {
>> -            bit = BLOCK_GROUP_DATA;
>> -        } else if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
>> -            bit = BLOCK_GROUP_SYSTEM;
>> -        } else if (cache->flags & BTRFS_BLOCK_GROUP_METADATA) {
>> -            bit = BLOCK_GROUP_METADATA;
>> -        }
>> -        set_avail_alloc_bits(info, cache->flags);
>> -        if (btrfs_chunk_readonly(info, cache->key.objectid))
>> -            cache->ro = 1;
>> -
>> -        exclude_super_stripes(info, cache);
>> -
>> -        ret = update_space_info(info, cache->flags, found_key.offset,
>> -                    btrfs_block_group_used(&cache->item),
>> -                    &space_info);
>> -        BUG_ON(ret);
>> -        cache->space_info = space_info;
>> -
>> -        /* use EXTENT_LOCKED to prevent merging */
>> -        set_extent_bits(block_group_cache, found_key.objectid,
>> -                found_key.objectid + found_key.offset - 1,
>> -                bit | EXTENT_LOCKED);
>> -        set_state_private(block_group_cache, found_key.objectid,
>> -                  (unsigned long)cache);
>> +        else
>> +            key.objectid = key.objectid + key.offset;
>> +        btrfs_release_path(&path);
>>       }
>>       ret = 0;
>>   error:
>> -    btrfs_free_path(path);
>> +    btrfs_release_path(&path);
>>       return ret;
>>   }
>>  
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/7] btrfs-progs: Enable read-write ability for 'bg_tree' feature
  2019-10-08  4:49 ` [PATCH v2 3/7] btrfs-progs: Enable read-write ability for 'bg_tree' feature Qu Wenruo
@ 2019-10-17  4:56   ` Anand Jain
  0 siblings, 0 replies; 28+ messages in thread
From: Anand Jain @ 2019-10-17  4:56 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 10/8/19 12:49 PM, Qu Wenruo wrote:
> Allow btrfs-progs to open, read and write 'bg_tree' enabled fs.
> 
> The modification itself is not large, as block groups items are only
> used at 4 timing:
> 1) open_ctree()
>     We only need to populate fs_info->bg_root and read block group items
>     from fs_info->bg_root.
>     The obvious change is, we don't need to do btrfs_search_slot() for
>     each block group item, but btrfs_next_item() is enough.
> 
>     This should hugely reduce open_ctree() execution duration.
> 
> 2) btrfs_commit_transaction()
>     We need to write back dirty block group items back to bg_root.
> 
>     The modification here is to insert new block group item if we can't
>     find one existing in bg_root, and delete the old one in extent tree
>     if we're converting to bg_tree feature.
> 
> 3) btrfs_make_block_group()
>     Just change @root for btrfs_insert_item() from @extent_root to
>     @bg_root for fs with that feature.
> 
>     This modification needs extra handling for converting case, where
>     block group items can be either in extent tree or bg tree.
> 
> 4) free_block_group_item()
>     Just delete the block group item in extent tree.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

looks good.
two nit below.

Reviewed-by: Anand Jain <anand.jain@oracle.com>


> ---
>   ctree.h       |  11 +++-
>   disk-io.c     |  20 +++++++
>   extent-tree.c | 152 +++++++++++++++++++++++++++++++++++++++++++++-----
>   3 files changed, 168 insertions(+), 15 deletions(-)
> 
> diff --git a/ctree.h b/ctree.h
> index 2899de358613..c2a18c8ab72f 100644
> --- a/ctree.h
> +++ b/ctree.h
> @@ -89,6 +89,9 @@ struct btrfs_free_space_ctl;
>   /* tracks free space in block groups. */
>   #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
>   
> +/* store BLOCK_GROUP_ITEMS in a seperate tree */
> +#define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
> +
>   /* device stats in the device tree */
>   #define BTRFS_DEV_STATS_OBJECTID 0ULL
>   
> @@ -490,6 +493,7 @@ struct btrfs_super_block {
>   #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
>   #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
>   #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID    (1ULL << 10)
> +#define BTRFS_FEATURE_INCOMPAT_BG_TREE		(1ULL << 11)
>   
>   #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
>   
> @@ -513,7 +517,8 @@ struct btrfs_super_block {
>   	 BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS |		\
>   	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
>   	 BTRFS_FEATURE_INCOMPAT_NO_HOLES |		\
> -	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
> +	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID |		\
> +	 BTRFS_FEATURE_INCOMPAT_BG_TREE)
>   
>   /*
>    * A leaf is full of items. offset and size tell us where to find
> @@ -1123,6 +1128,7 @@ struct btrfs_fs_info {
>   	struct btrfs_root *quota_root;
>   	struct btrfs_root *free_space_root;
>   	struct btrfs_root *uuid_root;
> +	struct btrfs_root *bg_root;
>   
>   	struct rb_root fs_root_tree;
>   
> @@ -1175,6 +1181,9 @@ struct btrfs_fs_info {
>   	unsigned int avoid_sys_chunk_alloc:1;
>   	unsigned int finalize_on_close:1;
>   
> +	/* Converting from bg in extent tree to bg tree */
> +	unsigned int convert_to_bg_tree:1;
> +

  This should probably be part of the patch 7/7.

>   	int transaction_aborted;
>   
>   	int (*free_extent_hook)(struct btrfs_fs_info *fs_info,
> diff --git a/disk-io.c b/disk-io.c
> index 8978f0cb60c7..38248aa895b8 100644
> --- a/disk-io.c
> +++ b/disk-io.c
> @@ -716,6 +716,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
>   	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
>   		return fs_info->free_space_root ? fs_info->free_space_root :
>   						ERR_PTR(-ENOENT);
> +	if (location->objectid == BTRFS_BLOCK_GROUP_TREE_OBJECTID)
> +		return fs_info->bg_root ? fs_info->bg_root : ERR_PTR(-ENOENT);
>   
>   	BUG_ON(location->objectid == BTRFS_TREE_RELOC_OBJECTID ||
>   	       location->offset != (u64)-1);
> @@ -768,6 +770,7 @@ struct btrfs_fs_info *btrfs_new_fs_info(int writable, u64 sb_bytenr)
>   	fs_info->quota_root = calloc(1, sizeof(struct btrfs_root));
>   	fs_info->free_space_root = calloc(1, sizeof(struct btrfs_root));
>   	fs_info->uuid_root = calloc(1, sizeof(struct btrfs_root));
> +	fs_info->bg_root = calloc(1, sizeof(struct btrfs_root));
>   	fs_info->super_copy = calloc(1, BTRFS_SUPER_INFO_SIZE);
>   
>   	if (!fs_info->tree_root || !fs_info->extent_root ||
> @@ -930,6 +933,21 @@ int btrfs_setup_all_roots(struct btrfs_fs_info *fs_info, u64 root_tree_bytenr,
>   		return ret;
>   	fs_info->extent_root->track_dirty = 1;
>   
> +	if (btrfs_fs_incompat(fs_info, BG_TREE)) {
> +		ret = setup_root_or_create_block(fs_info, flags,
> +					fs_info->bg_root,
> +					BTRFS_BLOCK_GROUP_TREE_OBJECTID, "bg");
> +		if (ret < 0) {
> +			error("Couldn't setup bg tree");
> +			return ret;
> +		}
> +		fs_info->bg_root->track_dirty = 1;
> +		fs_info->bg_root->ref_cows = 0;
> +	} else {
> +		free(fs_info->bg_root);
> +		fs_info->bg_root = NULL;
> +	}
> +
>   	ret = find_and_setup_root(root, fs_info, BTRFS_DEV_TREE_OBJECTID,
>   				  fs_info->dev_root);
>   	if (ret) {
> @@ -1012,6 +1030,8 @@ void btrfs_release_all_roots(struct btrfs_fs_info *fs_info)
>   		free_extent_buffer(fs_info->free_space_root->node);
>   	if (fs_info->quota_root)
>   		free_extent_buffer(fs_info->quota_root->node);
> +	if (fs_info->bg_root)
> +		free_extent_buffer(fs_info->bg_root->node);
>   	if (fs_info->csum_root)
>   		free_extent_buffer(fs_info->csum_root->node);
>   	if (fs_info->dev_root)
> diff --git a/extent-tree.c b/extent-tree.c
> index 9713d627764c..cb3d7a1add0f 100644
> --- a/extent-tree.c
> +++ b/extent-tree.c
> @@ -1528,22 +1528,68 @@ static int write_one_cache_group(struct btrfs_trans_handle *trans,
>   				 struct btrfs_path *path,
>   				 struct btrfs_block_group_cache *cache)
>   {
> +	bool is_bg_tree = btrfs_fs_incompat(trans->fs_info, BG_TREE);
>   	int ret;
> -	struct btrfs_root *extent_root = trans->fs_info->extent_root;
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_root *root;
>   	unsigned long bi;
>   	struct extent_buffer *leaf;
>   
> -	ret = btrfs_search_slot(trans, extent_root, &cache->key, path, 0, 1);
> +	if (is_bg_tree)
> +		root = fs_info->bg_root;
> +	else
> +		root = fs_info->extent_root;
> +
> +	ret = btrfs_search_slot(trans, root, &cache->key, path, 0, 1);
>   	if (ret < 0)
> -		goto fail;


> -	BUG_ON(ret);

  Unrelated change ?

Thanks, Anand

> +		goto out;
>   
> -	leaf = path->nodes[0];
> -	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> -	write_extent_buffer(leaf, &cache->item, bi, sizeof(cache->item));
> -	btrfs_mark_buffer_dirty(leaf);
> -	btrfs_release_path(path);
> -fail:
> +	if (ret == 0) {
> +		/* Update existing bg */
> +		leaf = path->nodes[0];
> +		bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> +		write_extent_buffer(leaf, &cache->item, bi, sizeof(cache->item));
> +		btrfs_mark_buffer_dirty(leaf);
> +		btrfs_release_path(path);
> +	} else {



> +		btrfs_release_path(path);
> +
> +		/*
> +		 * Insert new bg item
> +		 *
> +		 * This only happens for bg_tree feature
> +		 */
> +		if (!is_bg_tree) {
> +			error("can't find block group item for bytenr %llu",
> +			      cache->key.objectid);
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +		ret = btrfs_insert_item(trans, root, &cache->key, &cache->item,
> +					sizeof(cache->item));
> +		if (ret < 0)
> +			goto out;
> +
> +		/* Also delete the existing one in next tree if needed */
> +		if (fs_info->convert_to_bg_tree) {
> +			ret = btrfs_search_slot(trans, fs_info->extent_root,
> +						&cache->key, path, -1, 1);
> +			if (ret < 0) {
> +				btrfs_release_path(path);
> +				goto out;
> +			}
> +			/* Good, already converted */
> +			if (ret > 0) {
> +				ret = 0;
> +				btrfs_release_path(path);
> +				goto out;
> +			}
> +			/* Delete old block group item in extent tree */
> +			ret = btrfs_del_item(trans, fs_info->extent_root, path);
> +			btrfs_release_path(path);
> +		}
> +	}
> +out:
>   	if (ret)
>   		return ret;
>   	return 0;
> @@ -2717,14 +2763,66 @@ static int read_one_block_group(struct btrfs_fs_info *fs_info,
>   	return 0;
>   }
>   
> +static int read_block_group_tree(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_root *root = fs_info->bg_root;
> +	struct btrfs_key key = { 0 };
> +	struct btrfs_path path;
> +	int ret;
> +
> +	btrfs_init_path(&path);
> +	ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0);
> +	if (ret < 0) {
> +		errno = -ret;
> +		error("failed to search block group tree: %m");
> +		return ret;
> +	}
> +	if (ret == 0)
> +		goto invalid_key;
> +
> +	while (1) {
> +		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
> +		if (key.type != BTRFS_BLOCK_GROUP_ITEM_KEY)
> +			goto invalid_key;
> +
> +		ret = read_one_block_group(fs_info, &path);
> +		if (ret < 0) {
> +			errno = -ret;
> +			error("failed to read one block group: %m");
> +			goto out;
> +		}
> +		ret = btrfs_next_item(root, &path);
> +		if (ret < 0) {
> +			errno = -ret;
> +			error("failed to search block group tree: %m");
> +			goto out;
> +		}
> +		if (ret > 0) {
> +			ret = 0;
> +			break;
> +		}
> +	}
> +out:
> +	btrfs_release_path(&path);
> +	return ret;
> +
> +invalid_key:
> +	error("invalid key (%llu, %u, %llu) found in block group tree",
> +	      key.objectid, key.type, key.offset);
> +	btrfs_release_path(&path);
> +	return -EUCLEAN;
> +}
> +
>   int btrfs_read_block_groups(struct btrfs_fs_info *fs_info)
>   {
>   	struct btrfs_path path;
> -	struct btrfs_root *root;
> +	struct btrfs_root *root = fs_info->extent_root;
>   	int ret;
>   	struct btrfs_key key;
>   
> -	root = fs_info->extent_root;
> +	if (btrfs_fs_incompat(fs_info, BG_TREE))
> +		return read_block_group_tree(fs_info);
> +
>   	key.objectid = 0;
>   	key.offset = 0;
>   	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> @@ -2804,12 +2902,17 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
>   			   u64 type, u64 chunk_offset, u64 size)
>   {
>   	int ret;
> -	struct btrfs_root *extent_root = fs_info->extent_root;
> +	struct btrfs_root *root;
>   	struct btrfs_block_group_cache *cache;
>   
> +	if (btrfs_fs_incompat(fs_info, BG_TREE))
> +		root = fs_info->bg_root;
> +	else
> +		root = fs_info->extent_root;
> +
>   	cache = btrfs_add_block_group(fs_info, bytes_used, type, chunk_offset,
>   				      size);
> -	ret = btrfs_insert_item(trans, extent_root, &cache->key, &cache->item,
> +	ret = btrfs_insert_item(trans, root, &cache->key, &cache->item,
>   				sizeof(cache->item));
>   	BUG_ON(ret);
>   
> @@ -2943,8 +3046,16 @@ static int free_block_group_item(struct btrfs_trans_handle *trans,
>   	if (!path)
>   		return -ENOMEM;
>   
> +	/* Using bg tree only */
> +	if (btrfs_fs_incompat(fs_info, BG_TREE) && !fs_info->convert_to_bg_tree)
> +		goto bg_tree;
> +
>   	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>   	if (ret > 0) {
> +		if (btrfs_fs_incompat(fs_info, BG_TREE)) {
> +			btrfs_release_path(path);
> +			goto bg_tree;
> +		}
>   		ret = -ENOENT;
>   		goto out;
>   	}
> @@ -2952,6 +3063,19 @@ static int free_block_group_item(struct btrfs_trans_handle *trans,
>   		goto out;
>   
>   	ret = btrfs_del_item(trans, root, path);
> +	goto out;
> +
> +bg_tree:
> +	root = fs_info->bg_root;
> +	ret = btrfs_search_slot(trans, fs_info->bg_root, &key, path, -1, 1);
> +	if (ret < 0)
> +		goto out;
> +	if (ret > 0) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = btrfs_del_item(trans, root, path);
> +
>   out:
>   	btrfs_free_path(path);
>   	return ret;
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/7] btrfs-progs: Refactor btrfs_read_block_groups()
  2019-10-17  4:33     ` Qu Wenruo
@ 2019-10-17  5:08       ` Anand Jain
  0 siblings, 0 replies; 28+ messages in thread
From: Anand Jain @ 2019-10-17  5:08 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs



On 10/17/19 12:33 PM, Qu Wenruo wrote:
> 
> 
> On 2019/10/17 上午11:23, Anand Jain wrote:
>> On 10/8/19 12:49 PM, Qu Wenruo wrote:
>>> This patch does the following refactor:
>>> - Refactor parameter from @root to @fs_info
>>>
>>> - Refactor the large loop body into another function
>>>     Now we have a helper function, read_one_block_group(), to handle
>>>     block group cache and space info related routine.
>>>
>>> - Refactor the return value
>>>     Even we have the code handling ret > 0 from find_first_block_group(),
>>>     it never works, as when there is no more block group,
>>>     find_first_block_group() just return -ENOENT other than 1.
>>
>>
>>   Can it be separated into patches? My concern is as it alters the return
>>   value of the rescue command. So we shall have clarity of a discrete
>>   patch to blame. Otherwise I agree its a good change.
> 
> No problem.
> 
> What about 3 patches split by the mentioned 3 refactors?
>>
>>
>>>     This is super confusing, it's almost a mircle it even works.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>    ctree.h       |   2 +-
>>>    disk-io.c     |   9 ++-
>>>    extent-tree.c | 160 ++++++++++++++++++++++++++++----------------------
>>>    3 files changed, 97 insertions(+), 74 deletions(-)
>>>
>>> diff --git a/ctree.h b/ctree.h
>>> index 8c7b3cb40151..2899de358613 100644
>>> --- a/ctree.h
>>> +++ b/ctree.h
>>> @@ -2550,7 +2550,7 @@ int update_space_info(struct btrfs_fs_info
>>> *info, u64 flags,
>>>                  u64 total_bytes, u64 bytes_used,
>>>                  struct btrfs_space_info **space_info);
>>>    int btrfs_free_block_groups(struct btrfs_fs_info *info);
>>> -int btrfs_read_block_groups(struct btrfs_root *root);
>>> +int btrfs_read_block_groups(struct btrfs_fs_info *info);
>>>    struct btrfs_block_group_cache *
>>>    btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used,
>>> u64 type,
>>>                  u64 chunk_offset, u64 size);
>>> diff --git a/disk-io.c b/disk-io.c
>>> index be44eead5cef..8978f0cb60c7 100644
>>> --- a/disk-io.c
>>> +++ b/disk-io.c
>>> @@ -983,14 +983,17 @@ int btrfs_setup_all_roots(struct btrfs_fs_info
>>> *fs_info, u64 root_tree_bytenr,
>>>        fs_info->last_trans_committed = generation;
>>>        if (extent_buffer_uptodate(fs_info->extent_root->node) &&
>>>            !(flags & OPEN_CTREE_NO_BLOCK_GROUPS)) {
>>> -        ret = btrfs_read_block_groups(fs_info->tree_root);
>>> +        ret = btrfs_read_block_groups(fs_info);
>>>            /*
>>>             * If we don't find any blockgroups (ENOENT) we're either
>>>             * restoring or creating the filesystem, where it's expected,
>>>             * anything else is error
>>>             */
>>> -        if (ret != -ENOENT)
>>> -            return -EIO;
>>> +        if (ret < 0 && ret != -ENOENT) {
>>> +            errno = -ret;
>>> +            error("failed to read block groups: %m");
>>> +            return ret;
>>> +        }
>>>        }
>>
>>
>> As mentioned this alters the rescue command semantics as show below.
>> Earlier we had only -EIO as error now its much more and accurate
>> which is good. fstests is fine but anything else?
>>
>> cmd_rescue_chunk_recover()
>>    btrfs_recover_chunk_tree()
>>      open_ctree_with_broken_chunk()
>>        btrfs_setup_all_roots()
> 
> I'm not sure if I got the point.
> 
> Although btrfs_setup_all_roots() get called in above call chain, it
> doesn't have any special handling of -EIO or others.
> 
> It just reads the extent tree root.
> 
> Would you mind to explain a little more?

  sure.
  The above thread is in the call chain of the command
     btrfs rescue chunk-recover [options] <device>"

  And as the its return error code is being changed for the same problem,
  so a separate patch not part of the bg-tree changes would make sense.

Thanks, Anand

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-16 11:19       ` Qu WenRuo
@ 2019-10-18 17:27         ` David Sterba
  2019-10-19  0:04           ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2019-10-18 17:27 UTC (permalink / raw)
  To: Qu WenRuo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Wed, Oct 16, 2019 at 11:19:54AM +0000, Qu WenRuo wrote:
> >> The most important aspect to me is, to allow real world user of super
> >> large fs to try this feature, to prove the usefulness of this design,
> >> other than my on-paper analyse.
> >>
> >> That's why I'm pushing the patchset, even it may not pass any review.
> >> I just want to hold a up-to-date branch so that when some one needs, it
> >> can grab and try them themselves.
> > 
> > Ok that's fine and I can add the branch to for-next for ease of testing.
> > I'm working on a prototype that does it the bg item key way, it compiles
> > and creates almost correct filesystem, so I have to fix it before
> > posting. The patches are on top of your bg-tree feature so we could have
> > both in the same kernel for testing.
> 
> That's great!
> 
> As long as we're pushing a solution to the mount time problem, I can't
> be more happier!
> 
> Then I guess no matter which version get merged to upstream, the
> patchset is already meaningful.

We'll see what works in the end, I'm getting to the point where the
prototype almost works and am debugging weird problems or making sure
it's correct. So I'll dump the ideas here and link to the code so you
can have a look.

We agree on the point that the block group items must be packed. The key
approach should move the new BGI to the beginning, ie. key type is
smaller than anything that appears in the extent tree. I chose 100 for
the prototype, it could change.

To keep changes to minimum, the new BGI uses the same block group item,
so the only difference then becomes how we search for the items.

Packing of the items is done by swapping the key objectid and offset.

Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
the objectid is the thing that scatters the items all over the tree.

So the new BGI has bg.length == key.objectid and bg.start == key.offset.
As most of block groups are of same size, or from a small set, they're
packed.

The nice thing is that a lot of code can be shared between BGI and new
BGI, just needs some care with searches, inserts and search key
advances.

I'm now stuck a bit at mkfs, where the 8M block groups are separated by
some METADATA_ITEMs

 item 0 key (8388608 BLOCK_GROUP_ITEM_NEW 13631488) itemoff 16259 itemsize 24
         block group NEW used 0 chunk_objectid 256 flags DATA
 item 1 key (8388608 BLOCK_GROUP_ITEM_NEW 22020096) itemoff 16235 itemsize 24
         block group NEW used 16384 chunk_objectid 256 flags SYSTEM|DUP
 item 2 key (22036480 METADATA_ITEM 0) itemoff 16202 itemsize 33
         refs 1 gen 5 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root CHUNK_TREE
 item 3 key (30408704 METADATA_ITEM 0) itemoff 16169 itemsize 33
         refs 1 gen 4 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root FS_TREE
 item 4 key (30474240 METADATA_ITEM 0) itemoff 16136 itemsize 33
         refs 1 gen 4 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root CSUM_TREE
 item 5 key (30490624 METADATA_ITEM 0) itemoff 16103 itemsize 33
         refs 1 gen 4 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root DATA_RELOC_TREE
 item 6 key (30507008 METADATA_ITEM 0) itemoff 16070 itemsize 33
         refs 1 gen 4 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root UUID_TREE
 item 7 key (30523392 METADATA_ITEM 0) itemoff 16037 itemsize 33
         refs 1 gen 5 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root EXTENT_TREE
 item 8 key (30539776 METADATA_ITEM 0) itemoff 16004 itemsize 33
         refs 1 gen 5 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root DEV_TREE
 item 9 key (30556160 METADATA_ITEM 0) itemoff 15971 itemsize 33
         refs 1 gen 5 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root ROOT_TREE
 item 10 key (107347968 BLOCK_GROUP_ITEM_NEW 30408704) itemoff 15947 itemsize 24
         block group NEW used 114688 chunk_objectid 256 flags METADATA|DUP

After item 10, the rest of the block group would appear, and basically
the rest of the extent tree, many other items.

I don't want
to make hardcoded assumptins, that maximum objecit is 1G, but so far was
not able to come up with a generic and reliable formula how to set up
key for next search to reach item (107347968 BLOCK_GROUP_ITEM_NEW
30408704) once (8388608 BLOCK_GROUP_ITEM_NEW 22020096) has been
processed.

The swapped objectid and offset is the hard part for search because we
lose the linearity of block group start.  Advance objectid by one and
search again ie. (8388608+1 BGI_NEW 22020096) will land on the next
metadata item. Iterating objectid by one would eventually reach the 1G
block group item, but what to do after the last 1G item is found and we
want do decide wheather to continue or not?

This would be easy with the bg_tree, because we'd know that all items in
the tree are just the block group items. Some sort of enumeration could
work for bg_key too, but I don't have something solid.

The WIP is in my github repository branch dev/bg-key. It's not on top of
the bg_tree branch for now. The kernel part will be very similar once
the progs side is done.

Feedback welcome.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-18 17:27         ` David Sterba
@ 2019-10-19  0:04           ` Qu Wenruo
  2019-10-21 15:44             ` David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-19  0:04 UTC (permalink / raw)
  To: dsterba, Qu WenRuo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7165 bytes --]



On 2019/10/19 上午1:27, David Sterba wrote:
> On Wed, Oct 16, 2019 at 11:19:54AM +0000, Qu WenRuo wrote:
>>>> The most important aspect to me is, to allow real world user of super
>>>> large fs to try this feature, to prove the usefulness of this design,
>>>> other than my on-paper analyse.
>>>>
>>>> That's why I'm pushing the patchset, even it may not pass any review.
>>>> I just want to hold a up-to-date branch so that when some one needs, it
>>>> can grab and try them themselves.
>>>
>>> Ok that's fine and I can add the branch to for-next for ease of testing.
>>> I'm working on a prototype that does it the bg item key way, it compiles
>>> and creates almost correct filesystem, so I have to fix it before
>>> posting. The patches are on top of your bg-tree feature so we could have
>>> both in the same kernel for testing.
>>
>> That's great!
>>
>> As long as we're pushing a solution to the mount time problem, I can't
>> be more happier!
>>
>> Then I guess no matter which version get merged to upstream, the
>> patchset is already meaningful.
> 
> We'll see what works in the end, I'm getting to the point where the
> prototype almost works and am debugging weird problems or making sure
> it's correct. So I'll dump the ideas here and link to the code so you
> can have a look.

That's wonderful.
Although I guess my patchset should provide the hint of where to modify
the code, since there are only a limited number of places we modify
block group item.

> 
> We agree on the point that the block group items must be packed. The key
> approach should move the new BGI to the beginning, ie. key type is
> smaller than anything that appears in the extent tree. I chose 100 for
> the prototype, it could change.
> 
> To keep changes to minimum, the new BGI uses the same block group item,
> so the only difference then becomes how we search for the items.

If we're introducing new block group item, I hope to do a minor change.

Remove the chunk_objectid member, or even flags. to make it more
compact. So that you can make the BGI subtree even smaller.

But I guess since you don't want to modify the BGI structure, and keep
the code modification minimal, it may not be a good idea right now.

> 
> Packing of the items is done by swapping the key objectid and offset.
> 
> Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
> the objectid is the thing that scatters the items all over the tree.
> 
> So the new BGI has bg.length == key.objectid and bg.start == key.offset.
> As most of block groups are of same size, or from a small set, they're
> packed.

That doesn't look optimized enough.

bg.length can be at 1G, that means if extents starts below 1G can still
be before BGIs.

I believe we should have a fixed objectid for this new BGIs, so that
they are ensured to be at the beginning of extent tree.

> 
> The nice thing is that a lot of code can be shared between BGI and new
> BGI, just needs some care with searches, inserts and search key
> advances.

Exactly, but since we're introducing a new key type, I prefer to perfect
it. Not only change the key, but also the block group item structure to
make it more compact.

Although from the design aspect, it looks like BG tree along with new
BGI would be the best design.

New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
the most compact on-disk format.

> 
> I'm now stuck a bit at mkfs, where the 8M block groups are separated by
> some METADATA_ITEMs
> 
>  item 0 key (8388608 BLOCK_GROUP_ITEM_NEW 13631488) itemoff 16259 itemsize 24
>          block group NEW used 0 chunk_objectid 256 flags DATA
>  item 1 key (8388608 BLOCK_GROUP_ITEM_NEW 22020096) itemoff 16235 itemsize 24
>          block group NEW used 16384 chunk_objectid 256 flags SYSTEM|DUP
>  item 2 key (22036480 METADATA_ITEM 0) itemoff 16202 itemsize 33
>          refs 1 gen 5 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root CHUNK_TREE
>  item 3 key (30408704 METADATA_ITEM 0) itemoff 16169 itemsize 33
>          refs 1 gen 4 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root FS_TREE

Exactly the problem I described.


>  item 4 key (30474240 METADATA_ITEM 0) itemoff 16136 itemsize 33
>          refs 1 gen 4 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root CSUM_TREE
>  item 5 key (30490624 METADATA_ITEM 0) itemoff 16103 itemsize 33
>          refs 1 gen 4 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root DATA_RELOC_TREE
>  item 6 key (30507008 METADATA_ITEM 0) itemoff 16070 itemsize 33
>          refs 1 gen 4 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root UUID_TREE
>  item 7 key (30523392 METADATA_ITEM 0) itemoff 16037 itemsize 33
>          refs 1 gen 5 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root EXTENT_TREE
>  item 8 key (30539776 METADATA_ITEM 0) itemoff 16004 itemsize 33
>          refs 1 gen 5 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root DEV_TREE
>  item 9 key (30556160 METADATA_ITEM 0) itemoff 15971 itemsize 33
>          refs 1 gen 5 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root ROOT_TREE
>  item 10 key (107347968 BLOCK_GROUP_ITEM_NEW 30408704) itemoff 15947 itemsize 24
>          block group NEW used 114688 chunk_objectid 256 flags METADATA|DUP
> 
> After item 10, the rest of the block group would appear, and basically
> the rest of the extent tree, many other items.
> 
> I don't want
> to make hardcoded assumptins, that maximum objecit is 1G, but so far was
> not able to come up with a generic and reliable formula how to set up
> key for next search to reach item (107347968 BLOCK_GROUP_ITEM_NEW
> 30408704) once (8388608 BLOCK_GROUP_ITEM_NEW 22020096) has been
> processed.
> 
> The swapped objectid and offset is the hard part for search because we
> lose the linearity of block group start.  Advance objectid by one and
> search again ie. (8388608+1 BGI_NEW 22020096) will land on the next
> metadata item. Iterating objectid by one would eventually reach the 1G
> block group item, but what to do after the last 1G item is found and we
> want do decide wheather to continue or not?
> 
> This would be easy with the bg_tree, because we'd know that all items in
> the tree are just the block group items. Some sort of enumeration could
> work for bg_key too, but I don't have something solid.

Why not fixed objectid for BGI and just ignore the bg.len part?

We have chunk<->BGI verification code, no bg.len is not a problem at
all, we can still make sure chunk<->bg is 1:1 mapped and still verify
the bg.used.

Thanks,
Qu

> 
> The WIP is in my github repository branch dev/bg-key. It's not on top of
> the bg_tree branch for now. The kernel part will be very similar once
> the progs side is done.
> 
> Feedback welcome.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-19  0:04           ` Qu Wenruo
@ 2019-10-21 15:44             ` David Sterba
  2019-10-22  0:49               ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2019-10-21 15:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu WenRuo, linux-btrfs

On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
> That's wonderful.
> Although I guess my patchset should provide the hint of where to modify
> the code, since there are only a limited number of places we modify
> block group item.

I indeed started at your patchset and grepped fro BG_TREE, adding
another branch.

> > We agree on the point that the block group items must be packed. The key
> > approach should move the new BGI to the beginning, ie. key type is
> > smaller than anything that appears in the extent tree. I chose 100 for
> > the prototype, it could change.
> > 
> > To keep changes to minimum, the new BGI uses the same block group item,
> > so the only difference then becomes how we search for the items.
> 
> If we're introducing new block group item, I hope to do a minor change.
> 
> Remove the chunk_objectid member, or even flags. to make it more
> compact. So that you can make the BGI subtree even smaller.

Yeah that can be done.

> But I guess since you don't want to modify the BGI structure, and keep
> the code modification minimal, it may not be a good idea right now.

As long as the changes are bearable, eg. a minor refactoring of block
group access (the cache.key serving a as offset and length) is fine. And
if we can make the b-tree item more then let's do it.

> > Packing of the items is done by swapping the key objectid and offset.
> > 
> > Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
> > the objectid is the thing that scatters the items all over the tree.
> > 
> > So the new BGI has bg.length == key.objectid and bg.start == key.offset.
> > As most of block groups are of same size, or from a small set, they're
> > packed.
> 
> That doesn't look optimized enough.
> 
> bg.length can be at 1G, that means if extents starts below 1G can still
> be before BGIs.

This shold not be a big problem, the items are still grouped togethers.
Mkfs does 8M, we can have 256M or 1G. On average there could be several
packed groups, which I think is fine and the estimated overhead would be
a few more seeks.

> I believe we should have a fixed objectid for this new BGIs, so that
> they are ensured to be at the beginning of extent tree.

That was my idea too, but that also requires to add one more member to
the item to track the length. Currently the key is saves the bytes. With
the proposed changes to drop chunk_objectid, the overall size of BGI
would not increase so this still sounds ok. And all the problems with
searching would go away.

> > The nice thing is that a lot of code can be shared between BGI and new
> > BGI, just needs some care with searches, inserts and search key
> > advances.
> 
> Exactly, but since we're introducing a new key type, I prefer to perfect
> it. Not only change the key, but also the block group item structure to
> make it more compact.
> 
> Although from the design aspect, it looks like BG tree along with new
> BGI would be the best design.
> 
> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
> the most compact on-disk format.

That's very compact. Given the 'bg start' we can't use the same for the
extent tree item.

> > This would be easy with the bg_tree, because we'd know that all items in
> > the tree are just the block group items. Some sort of enumeration could
> > work for bg_key too, but I don't have something solid.
> 
> Why not fixed objectid for BGI and just ignore the bg.len part?

I wanted to avoid storing a useless number, it costs 8 bytes per item,
and the simple swap of objectid/offset was first idea how to avoid it.

> We have chunk<->BGI verification code, no bg.len is not a problem at
> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
> the bg.used.

This is all great, sure, and makes the extensions easy. Let's try to
work out best for each approach we have so far. Having a separate tree
for block groups may open some future options.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-21 15:44             ` David Sterba
@ 2019-10-22  0:49               ` Qu Wenruo
  2019-10-22  6:30                 ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-22  0:49 UTC (permalink / raw)
  To: dsterba, Qu WenRuo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4389 bytes --]



On 2019/10/21 下午11:44, David Sterba wrote:
> On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
>> That's wonderful.
>> Although I guess my patchset should provide the hint of where to modify
>> the code, since there are only a limited number of places we modify
>> block group item.
> 
> I indeed started at your patchset and grepped fro BG_TREE, adding
> another branch.
> 
>>> We agree on the point that the block group items must be packed. The key
>>> approach should move the new BGI to the beginning, ie. key type is
>>> smaller than anything that appears in the extent tree. I chose 100 for
>>> the prototype, it could change.
>>>
>>> To keep changes to minimum, the new BGI uses the same block group item,
>>> so the only difference then becomes how we search for the items.
>>
>> If we're introducing new block group item, I hope to do a minor change.
>>
>> Remove the chunk_objectid member, or even flags. to make it more
>> compact. So that you can make the BGI subtree even smaller.
> 
> Yeah that can be done.
> 
>> But I guess since you don't want to modify the BGI structure, and keep
>> the code modification minimal, it may not be a good idea right now.
> 
> As long as the changes are bearable, eg. a minor refactoring of block
> group access (the cache.key serving a as offset and length) is fine. And
> if we can make the b-tree item more then let's do it.
> 
>>> Packing of the items is done by swapping the key objectid and offset.
>>>
>>> Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
>>> the objectid is the thing that scatters the items all over the tree.
>>>
>>> So the new BGI has bg.length == key.objectid and bg.start == key.offset.
>>> As most of block groups are of same size, or from a small set, they're
>>> packed.
>>
>> That doesn't look optimized enough.
>>
>> bg.length can be at 1G, that means if extents starts below 1G can still
>> be before BGIs.
> 
> This shold not be a big problem, the items are still grouped togethers.
> Mkfs does 8M, we can have 256M or 1G. On average there could be several
> packed groups, which I think is fine and the estimated overhead would be
> a few more seeks.
> 
>> I believe we should have a fixed objectid for this new BGIs, so that
>> they are ensured to be at the beginning of extent tree.
> 
> That was my idea too, but that also requires to add one more member to
> the item to track the length. Currently the key is saves the bytes. With
> the proposed changes to drop chunk_objectid, the overall size of BGI
> would not increase so this still sounds ok. And all the problems with
> searching would go away.
> 
>>> The nice thing is that a lot of code can be shared between BGI and new
>>> BGI, just needs some care with searches, inserts and search key
>>> advances.
>>
>> Exactly, but since we're introducing a new key type, I prefer to perfect
>> it. Not only change the key, but also the block group item structure to
>> make it more compact.
>>
>> Although from the design aspect, it looks like BG tree along with new
>> BGI would be the best design.
>>
>> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
>> the most compact on-disk format.
> 
> That's very compact. Given the 'bg start' we can't use the same for the
> extent tree item.
> 
>>> This would be easy with the bg_tree, because we'd know that all items in
>>> the tree are just the block group items. Some sort of enumeration could
>>> work for bg_key too, but I don't have something solid.
>>
>> Why not fixed objectid for BGI and just ignore the bg.len part?
> 
> I wanted to avoid storing a useless number, it costs 8 bytes per item,
> and the simple swap of objectid/offset was first idea how to avoid it.
> 
>> We have chunk<->BGI verification code, no bg.len is not a problem at
>> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
>> the bg.used.
> 
> This is all great, sure, and makes the extensions easy. Let's try to
> work out best for each approach we have so far. Having a separate tree
> for block groups may open some future options.

Great, I'll explore the most compact key only method with BG_TREE.

And maybe also try the fixed key objectid solution, just dropping the
bg.length, while keep the current BGI structure.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-22  0:49               ` Qu Wenruo
@ 2019-10-22  6:30                 ` Qu Wenruo
  2019-10-22 12:23                   ` David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2019-10-22  6:30 UTC (permalink / raw)
  To: dsterba, Qu WenRuo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5254 bytes --]

BTW, there is one important compatibility problem related to all the BGI
related features.

Although I'm putting the BG_TREE feature as incompatible feature, but in
theory, it should be RO compatible.

As except extent/bg tree, we *should* read the fs without any problem.

But the problem is, current btrfs mount (including btrfs-check) still
need to go through the block group item search, even for permanent RO mount.

This get my rescue mount option patchset to be involved.
If we have such skip_bg feature earlier, we can completely afford to
make all these potential features as RO compatible.


Now my question is,  should we put this feature still as incompatible
feature?

Thanks,
Qu


On 2019/10/22 上午8:49, Qu Wenruo wrote:
> 
> 
> On 2019/10/21 下午11:44, David Sterba wrote:
>> On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
>>> That's wonderful.
>>> Although I guess my patchset should provide the hint of where to modify
>>> the code, since there are only a limited number of places we modify
>>> block group item.
>>
>> I indeed started at your patchset and grepped fro BG_TREE, adding
>> another branch.
>>
>>>> We agree on the point that the block group items must be packed. The key
>>>> approach should move the new BGI to the beginning, ie. key type is
>>>> smaller than anything that appears in the extent tree. I chose 100 for
>>>> the prototype, it could change.
>>>>
>>>> To keep changes to minimum, the new BGI uses the same block group item,
>>>> so the only difference then becomes how we search for the items.
>>>
>>> If we're introducing new block group item, I hope to do a minor change.
>>>
>>> Remove the chunk_objectid member, or even flags. to make it more
>>> compact. So that you can make the BGI subtree even smaller.
>>
>> Yeah that can be done.
>>
>>> But I guess since you don't want to modify the BGI structure, and keep
>>> the code modification minimal, it may not be a good idea right now.
>>
>> As long as the changes are bearable, eg. a minor refactoring of block
>> group access (the cache.key serving a as offset and length) is fine. And
>> if we can make the b-tree item more then let's do it.
>>
>>>> Packing of the items is done by swapping the key objectid and offset.
>>>>
>>>> Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
>>>> the objectid is the thing that scatters the items all over the tree.
>>>>
>>>> So the new BGI has bg.length == key.objectid and bg.start == key.offset.
>>>> As most of block groups are of same size, or from a small set, they're
>>>> packed.
>>>
>>> That doesn't look optimized enough.
>>>
>>> bg.length can be at 1G, that means if extents starts below 1G can still
>>> be before BGIs.
>>
>> This shold not be a big problem, the items are still grouped togethers.
>> Mkfs does 8M, we can have 256M or 1G. On average there could be several
>> packed groups, which I think is fine and the estimated overhead would be
>> a few more seeks.
>>
>>> I believe we should have a fixed objectid for this new BGIs, so that
>>> they are ensured to be at the beginning of extent tree.
>>
>> That was my idea too, but that also requires to add one more member to
>> the item to track the length. Currently the key is saves the bytes. With
>> the proposed changes to drop chunk_objectid, the overall size of BGI
>> would not increase so this still sounds ok. And all the problems with
>> searching would go away.
>>
>>>> The nice thing is that a lot of code can be shared between BGI and new
>>>> BGI, just needs some care with searches, inserts and search key
>>>> advances.
>>>
>>> Exactly, but since we're introducing a new key type, I prefer to perfect
>>> it. Not only change the key, but also the block group item structure to
>>> make it more compact.
>>>
>>> Although from the design aspect, it looks like BG tree along with new
>>> BGI would be the best design.
>>>
>>> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
>>> the most compact on-disk format.
>>
>> That's very compact. Given the 'bg start' we can't use the same for the
>> extent tree item.
>>
>>>> This would be easy with the bg_tree, because we'd know that all items in
>>>> the tree are just the block group items. Some sort of enumeration could
>>>> work for bg_key too, but I don't have something solid.
>>>
>>> Why not fixed objectid for BGI and just ignore the bg.len part?
>>
>> I wanted to avoid storing a useless number, it costs 8 bytes per item,
>> and the simple swap of objectid/offset was first idea how to avoid it.
>>
>>> We have chunk<->BGI verification code, no bg.len is not a problem at
>>> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
>>> the bg.used.
>>
>> This is all great, sure, and makes the extensions easy. Let's try to
>> work out best for each approach we have so far. Having a separate tree
>> for block groups may open some future options.
> 
> Great, I'll explore the most compact key only method with BG_TREE.
> 
> And maybe also try the fixed key objectid solution, just dropping the
> bg.length, while keep the current BGI structure.
> 
> Thanks,
> Qu
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-22  6:30                 ` Qu Wenruo
@ 2019-10-22 12:23                   ` David Sterba
  2019-10-22 12:27                     ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2019-10-22 12:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu WenRuo, linux-btrfs

On Tue, Oct 22, 2019 at 02:30:22PM +0800, Qu Wenruo wrote:
> BTW, there is one important compatibility problem related to all the BGI
> related features.
> 
> Although I'm putting the BG_TREE feature as incompatible feature, but in
> theory, it should be RO compatible.

It could be RO compatible yes.

> As except extent/bg tree, we *should* read the fs without any problem.
> 
> But the problem is, current btrfs mount (including btrfs-check) still
> need to go through the block group item search, even for permanent RO mount.
> 
> This get my rescue mount option patchset to be involved.
> If we have such skip_bg feature earlier, we can completely afford to
> make all these potential features as RO compatible.
> 
> 
> Now my question is,  should we put this feature still as incompatible
> feature?

In some way it would probably have to be incompat, either full or RO. As
unknown tree items are ignored, if the rest of the filesystem provides
enough information to access the data, then incompat RO sounds like the
best option. And that's probably independent of how exactly the new BGI
is done.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature
  2019-10-22 12:23                   ` David Sterba
@ 2019-10-22 12:27                     ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2019-10-22 12:27 UTC (permalink / raw)
  To: dsterba, Qu WenRuo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1709 bytes --]



On 2019/10/22 下午8:23, David Sterba wrote:
> On Tue, Oct 22, 2019 at 02:30:22PM +0800, Qu Wenruo wrote:
>> BTW, there is one important compatibility problem related to all the BGI
>> related features.
>>
>> Although I'm putting the BG_TREE feature as incompatible feature, but in
>> theory, it should be RO compatible.
> 
> It could be RO compatible yes.
> 
>> As except extent/bg tree, we *should* read the fs without any problem.
>>
>> But the problem is, current btrfs mount (including btrfs-check) still
>> need to go through the block group item search, even for permanent RO mount.
>>
>> This get my rescue mount option patchset to be involved.
>> If we have such skip_bg feature earlier, we can completely afford to
>> make all these potential features as RO compatible.
>>
>>
>> Now my question is,  should we put this feature still as incompatible
>> feature?
> 
> In some way it would probably have to be incompat, either full or RO. As
> unknown tree items are ignored, if the rest of the filesystem provides
> enough information to access the data, then incompat RO sounds like the
> best option. And that's probably independent of how exactly the new BGI
> is done.
> 
But the problem is, older fs can't mount it at all, no matter RO or RW.

It's now completely dependent on how older kernel handles missing block
group items.

If current kernel has something like skip_bg or fall back to RO +
skip_bg by default, then it's really RO compatible.
But we all know how picky current btrfs is about missing block group
items...

So, I guess we will have a RO compatible feature that can't really be
mounted RO by older kernel at last?

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-10-22 12:28 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08  4:49 [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Qu Wenruo
2019-10-08  4:49 ` [PATCH v2 1/7] btrfs-progs: Refactor excluded extent functions to use fs_info Qu Wenruo
2019-10-08  9:22   ` Johannes Thumshirn
2019-10-17  2:16   ` Anand Jain
2019-10-08  4:49 ` [PATCH v2 2/7] btrfs-progs: Refactor btrfs_read_block_groups() Qu Wenruo
2019-10-17  3:23   ` Anand Jain
2019-10-17  4:33     ` Qu Wenruo
2019-10-17  5:08       ` Anand Jain
2019-10-08  4:49 ` [PATCH v2 3/7] btrfs-progs: Enable read-write ability for 'bg_tree' feature Qu Wenruo
2019-10-17  4:56   ` Anand Jain
2019-10-08  4:49 ` [PATCH v2 4/7] btrfs-progs: mkfs: Introduce -O bg-tree Qu Wenruo
2019-10-08  8:16   ` [PATCH v2.1 " Qu Wenruo
2019-10-08  4:49 ` [PATCH v2 5/7] btrfs-progs: dump-tree/dump-super: Introduce support for bg tree Qu Wenruo
2019-10-08  4:49 ` [PATCH v2 6/7] btrfs-progs: check: Introduce support for bg-tree feature Qu Wenruo
2019-10-08  4:49 ` [PATCH v2 7/7] btrfs-progs: btrfstune: Allow to enable bg-tree feature offline Qu Wenruo
2019-10-17  4:17   ` Anand Jain
2019-10-17  4:28     ` Qu Wenruo
2019-10-14 15:17 ` [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature David Sterba
2019-10-15  0:32   ` Qu Wenruo
2019-10-16 11:16     ` David Sterba
2019-10-16 11:19       ` Qu WenRuo
2019-10-18 17:27         ` David Sterba
2019-10-19  0:04           ` Qu Wenruo
2019-10-21 15:44             ` David Sterba
2019-10-22  0:49               ` Qu Wenruo
2019-10-22  6:30                 ` Qu Wenruo
2019-10-22 12:23                   ` David Sterba
2019-10-22 12:27                     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).