All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata
@ 2018-09-11  5:38 Qu Wenruo
  2018-09-11  5:38 ` [PATCH v3 1/7] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Qu Wenruo @ 2018-09-11  5:38 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from github:
https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees
The base commit is v4.19-rc1 tag.

There are a lot of reports of system hang for balance on quota enabled
fs.
It's most obvious for large fs.

The hang is caused by tons of unmodified extents marked as qgroup dirty.
Such unmodified/unrelated sources include:
1) Unmodified subtree
2) Subtree drop for reloc tree
(BTW, other sources includes unmodified file extent items)

E.g.
OO = Old tree blocks from file tree
NN = New tree blocks from reloc tree

        file tree                              reloc tree
           OO (a)                                  NN (a)
          /  \                                    /  \
    (b) OO    OO (c)                        (b) NN    NN (c)
       / \   / \                               / \   / \
     OO  OO OO  OO                           OO  OO OO  NN
    (d) (e) (f) (g)                         (d) (e) (f) (g)

In above case, balance will modify nodeptr in OO(a) to point NN(b) and
NN(c), and modify NN(a) to point to OO(B) and OO(c).

Before this patch, quota will mark the whole subtree from its parent
down to the leaves as dirty.
So btrfs quota need to trace all tree block from (a) to (g).

However tree blocks (d) (e) (f) are shared between both trees, thus
there is no need to trace those 3 tree blocks.

This patchset will change how this work by only tracing modified tree
blocks in reloc tree, and their counter parts in file tree.

With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
thus reduce some some overhead caused by qgroup.

The improvement is mostly related to metadata relocation.

Also, for metadata relocation, we don't really need to trace data
extents as they're not modified at all.


[[Benchmark]]
Hardware:
	VM 4G vRAM, 8 vCPUs,
	disk is using 'unsafe' cache mode,
	backing device is SAMSUNG 850 evo SSD.
	Host has 16G ram.

Mkfs parameter:
	--nodesize 4K (To bump up tree size)

Initial subvolume contents:
	4G data copied from /usr and /lib.
	(With enough regular small files)

Snapshots:
	16 snapshots of the original subvolume.
	each snapshot has 3 random files modified.

balance parameter:
	-m

So the content should be pretty similar to a real world root fs layout.

                     | v4.19-rc1    | w/ patchset    | diff (*)
---------------------------------------------------------------
relocated extents    | 22874        | 22856          | +0.1%
qgroup dirty extents | 225251       | 140431         | -37.7%
time (sys)           | 40.161s      | 20.574s        | -48.7%
time (real)          | 42.163s      | 25.173s        | -40.3%

*: (val_new - val_old) / val_old * 100%

And the difference is becoming more and more obvious if more snapshots
are created.

If we increase the number of snapshots to 64 (4 times the number of
references, and 64 snapshots is already not recommended to use with
quota)

                     | v4.19-rc1    | w/ patchset    | diff (*)
---------------------------------------------------------------
relocated extents    | 22462        | 22467          | +0.0%
qgroup dirty extents | 314380       | 140826         | -55.2%
time (sys)           | 158.033s     | 74.292s        | -53.0%
time (real)          | 197.395s     | 90.529s        | -67.6%

For larger fs the saving should be even more obvious.

Changelog:
v2:
  Rename "tree reloc tree" to "reloc tree".
  Add patch "Don't trace subtree if we're dropping reloc tree" into the
  patchset.
  Fix wrong btrfs_bin_search() call, which leads to unexpected ENOENT
  error for btrfs_qgroup_trace_extent_swap(). Now use dst_path->slots[]
  directly.
  
v3:
  Add new patch to avoid unnecessary data extents trace for metadata
  relocation.
  Better delayed ref time root owner detection to avoid unnecessary tree
  block tracing.
  Add benchmark result for the patchset.

Qu Wenruo (7):
  btrfs: qgroup: Introduce trace event to analyse the number of dirty
    extents accounted
  btrfs: qgroup: Introduce function to trace two swaped extents
  btrfs: qgroup: Introduce function to find all new tree blocks of reloc
    tree
  btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
  btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
  btrfs: delayed-ref: Introduce new parameter for
    btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing
  btrfs: qgroup: Only trace data extents in leaves if we're relocating
    data block group

 fs/btrfs/delayed-ref.c       |   5 +-
 fs/btrfs/delayed-ref.h       |   1 +
 fs/btrfs/extent-tree.c       |  25 ++-
 fs/btrfs/qgroup.c            | 347 +++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h            |  11 ++
 fs/btrfs/relocation.c        |  15 +-
 include/trace/events/btrfs.h |  21 +++
 7 files changed, 405 insertions(+), 20 deletions(-)

-- 
2.18.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/7] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted
  2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
@ 2018-09-11  5:38 ` Qu Wenruo
  2018-09-26 14:06   ` David Sterba
  2018-09-11  5:38 ` [PATCH v3 2/7] btrfs: qgroup: Introduce function to trace two swaped extents Qu Wenruo
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2018-09-11  5:38 UTC (permalink / raw)
  To: linux-btrfs

Number of qgroup dirty extents is directly linked to the performance
overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
record how many dirty extents is processed in
btrfs_qgroup_account_extents().

This will be pretty handy to analyse later balance performance
improvement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c            |  4 ++++
 include/trace/events/btrfs.h | 21 +++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4353bb69bb86..5977eedc00d8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2133,6 +2133,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct ulist *new_roots = NULL;
 	struct rb_node *node;
+	u64 num_dirty_extents = 0;
 	u64 qgroup_to_skip;
 	int ret = 0;
 
@@ -2142,6 +2143,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 		record = rb_entry(node, struct btrfs_qgroup_extent_record,
 				  node);
 
+		num_dirty_extents++;
 		trace_btrfs_qgroup_account_extents(fs_info, record);
 
 		if (!ret) {
@@ -2187,6 +2189,8 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 		kfree(record);
 
 	}
+	trace_qgroup_num_dirty_extents(fs_info, trans->transid,
+				       num_dirty_extents);
 	return ret;
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b401c4e36394..79253666d1d0 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1575,6 +1575,27 @@ DEFINE_EVENT(btrfs_qgroup_extent, btrfs_qgroup_trace_extent,
 	TP_ARGS(fs_info, rec)
 );
 
+TRACE_EVENT(qgroup_num_dirty_extents,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 transid,
+		 u64 num_dirty_extents),
+
+	TP_ARGS(fs_info, transid, num_dirty_extents),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64, transid			)
+		__field(	u64, num_dirty_extents		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->transid	   = transid;
+		__entry->num_dirty_extents = num_dirty_extents;
+	),
+
+	TP_printk_btrfs("transid=%llu num_dirty_extents=%llu",
+		__entry->transid, __entry->num_dirty_extents)
+);
+
 TRACE_EVENT(btrfs_qgroup_account_extent,
 
 	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 transid, u64 bytenr,
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/7] btrfs: qgroup: Introduce function to trace two swaped extents
  2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
  2018-09-11  5:38 ` [PATCH v3 1/7] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
@ 2018-09-11  5:38 ` Qu Wenruo
  2018-09-11  5:38 ` [PATCH v3 3/7] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree Qu Wenruo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2018-09-11  5:38 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function, qgroup_trace_extent_swap(), which will be used
later for balance qgroup speedup.

The basis idea of balance is swapping tree blocks between reloc tree and
the real file tree.

The swap will happen in highest tree block, but there may be a lot of
tree blocks involved.

For example:
 OO = Old tree blocks
 NN = New tree blocks allocated during balance

          File tree (257)                  Reloc tree for 257
L2              OO                                NN
              /    \                            /    \
L1          OO      OO (a)                    OO      NN (a)
           / \     / \                       / \     / \
L0       OO   OO OO   OO                   OO   OO NN   NN
                 (b)  (c)                          (b)  (c)

When calling qgroup_trace_extent_swap(), we will pass:
@src_eb = OO(a)
@dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
@dst_level = 0
@root_level = 1

In that case, qgroup_trace_extent_swap() will search from OO(a) to
reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.

The main work of qgroup_trace_extent_swap() can be split into 3 parts:

1) Tree search from @src_eb
   It should acts as a simplified btrfs_search_slot().
   The key for search can be extracted from @dst_path->nodes[dst_level]
   (first key).

2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
   NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
   They should be marked during preivous (@dst_level = 1) iteration.

3) Mark file extents in leaves dirty
   We don't have good way to pick out new file extents only.
   So we still follow the old method by scanning all file extents in
   the leave.

This function can free us from keeping two pathes, thus later we only need
to care about how to iterate all new tree blocks in reloc tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 5977eedc00d8..5155fb42ce79 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1713,6 +1713,132 @@ static int adjust_slots_upwards(struct btrfs_path *path, int root_level)
 	return 0;
 }
 
+/*
+ * helper function to trace a subtree tree block swap.
+ *
+ * The swapped tree block is marked by @dst_path->nodes[level].
+ * Thus @dst_path should have been populated already.
+ *
+ * While for src tree block, we only need the root eb as @src_eb.
+ * The needed path and search will done by this function.
+ */
+static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
+				    struct extent_buffer *src_eb,
+				    struct btrfs_path *dst_path,
+				    int dst_level, int root_level)
+{
+	struct btrfs_key key;
+	struct btrfs_path *src_path;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	u32 nodesize = fs_info->nodesize;
+	int cur_level = root_level;
+	int ret;
+
+	BUG_ON(dst_level > root_level);
+	/* Level mismatch */
+	if (btrfs_header_level(src_eb) != root_level)
+		return -EINVAL;
+
+	src_path = btrfs_alloc_path();
+	if (!src_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (dst_level)
+		btrfs_node_key_to_cpu(dst_path->nodes[dst_level], &key, 0);
+	else
+		btrfs_item_key_to_cpu(dst_path->nodes[dst_level], &key, 0);
+
+	/* For src_path */
+	extent_buffer_get(src_eb);
+	src_path->nodes[root_level] = src_eb;
+	src_path->slots[root_level] = dst_path->slots[root_level];
+	src_path->locks[root_level] = 0;
+
+	/* A simplified version of btrfs_search_slot() */
+	while (cur_level >= dst_level) {
+		struct btrfs_key src_key;
+		struct btrfs_key dst_key;
+
+		if (src_path->nodes[cur_level] == NULL) {
+			struct btrfs_key first_key;
+			struct extent_buffer *eb;
+			int parent_slot;
+			u64 child_gen;
+			u64 child_bytenr;
+
+			eb = src_path->nodes[cur_level + 1];
+			parent_slot = src_path->slots[cur_level + 1];
+			child_bytenr = btrfs_node_blockptr(eb, parent_slot);
+			child_gen = btrfs_node_ptr_generation(eb, parent_slot);
+			btrfs_node_key_to_cpu(eb, &first_key, parent_slot);
+
+			eb = read_tree_block(fs_info, child_bytenr, child_gen,
+					     cur_level, &first_key);
+			if (IS_ERR(eb)) {
+				ret = PTR_ERR(eb);
+				goto out;
+			} else if (!extent_buffer_uptodate(eb)) {
+				free_extent_buffer(eb);
+				ret = -EIO;
+				goto out;
+			}
+
+			src_path->nodes[cur_level] = eb;
+
+			btrfs_tree_read_lock(eb);
+			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+			src_path->locks[cur_level] = BTRFS_READ_LOCK_BLOCKING;
+		}
+
+		src_path->slots[cur_level] = dst_path->slots[cur_level];
+		if (cur_level) {
+			btrfs_node_key_to_cpu(dst_path->nodes[cur_level], &dst_key,
+					      dst_path->slots[cur_level]);
+			btrfs_node_key_to_cpu(src_path->nodes[cur_level], &src_key,
+					      src_path->slots[cur_level]);
+		} else {
+			btrfs_item_key_to_cpu(dst_path->nodes[cur_level], &dst_key,
+					      dst_path->slots[cur_level]);
+			btrfs_item_key_to_cpu(src_path->nodes[cur_level], &src_key,
+					      src_path->slots[cur_level]);
+		}
+		/* Content mismatch, something went wrong */
+		if (btrfs_comp_cpu_keys(&dst_key, &src_key)) {
+			ret = -ENOENT;
+			goto out;
+		}
+		cur_level--;
+	}
+
+	/*
+	 * Now both @dst_path and @src_path have been populated, record the tree
+	 * blocks for qgroup accounting.
+	 */
+	ret = btrfs_qgroup_trace_extent(trans,
+			src_path->nodes[dst_level]->start,
+			nodesize, GFP_NOFS);
+	if (ret < 0)
+		goto out;
+	ret = btrfs_qgroup_trace_extent(trans,
+			dst_path->nodes[dst_level]->start,
+			nodesize, GFP_NOFS);
+	if (ret < 0)
+		goto out;
+
+	/* Record leaf file extents */
+	if (dst_level == 0) {
+		ret = btrfs_qgroup_trace_leaf_items(trans, src_path->nodes[0]);
+		if (ret < 0)
+			goto out;
+		ret = btrfs_qgroup_trace_leaf_items(trans, dst_path->nodes[0]);
+	}
+out:
+	btrfs_free_path(src_path);
+	return ret;
+}
+
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level)
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/7] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree
  2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
  2018-09-11  5:38 ` [PATCH v3 1/7] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
  2018-09-11  5:38 ` [PATCH v3 2/7] btrfs: qgroup: Introduce function to trace two swaped extents Qu Wenruo
@ 2018-09-11  5:38 ` Qu Wenruo
  2018-09-26 14:29   ` David Sterba
  2018-09-11  5:38 ` [PATCH v3 4/7] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents Qu Wenruo
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2018-09-11  5:38 UTC (permalink / raw)
  To: linux-btrfs

Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
all new tree blocks in a reloc tree.
So that qgroup could skip unrelated tree blocks during balance, which
should hugely speedup balance speed when quota is enabled.

The function qgroup_trace_new_subtree_blocks() itself only cares about
new tree blocks in reloc tree.

All its main works are:

1) Read out tree blocks according to parent pointers

2) Do recursive depth-first search
   Will call the same function on all its children tree blocks, with
   search level set to current level -1.
   And will also skip all children whose generation is smaller than
   @last_snapshot.

3) Call qgroup_trace_extent_swap() to trace tree blocks

So although we have parameter list related to source file tree, it's not
used at all, but only passed to qgroup_trace_extent_swap().
Thus despite the tree read code, the core should be pretty short and all
about recursive depth-first search.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 5155fb42ce79..0702953d70a7 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1839,6 +1839,120 @@ static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
 	return ret;
 }
 
+/*
+ * Helper function to do recursive generation aware depth-first search, to
+ * locate all new tree blocks in a subtree of tree reloc tree.
+ *
+ * E.g. (OO = Old tree blocks, NN = New tree blocks, whose gen == last_snapshot)
+ *       Tree reloc tree
+ * L2         NN (a)
+ *          /    \
+ * L1    OO        NN (b)
+ *      /  \      /  \
+ * L0  OO  OO    OO  NN
+ *               (c) (d)
+ * If we pass:
+ * @dst_path = [ nodes[1] = NN(b), nodes[0] = NULL ],
+ * @cur_level = 1
+ * @root_level = 1
+ *
+ * We will iterate through tree blocks NN(b), NN(d) and info qgroup to trace
+ * above tree blocks along with their counter parts in file tree.
+ * While during search, old tree blocsk OO(c) will be skiped as tree block swap
+ * won't affect OO(c).
+ */
+static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
+					   struct extent_buffer *src_eb,
+					   struct btrfs_path *dst_path,
+					   int cur_level, int root_level,
+					   u64 last_snapshot)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct extent_buffer *eb;
+	bool need_cleanup = false;
+	int ret = 0;
+	int i;
+
+	/* Read the tree block if needed */
+	if (dst_path->nodes[cur_level] == NULL) {
+		struct btrfs_key first_key;
+		int parent_slot;
+		u64 child_gen;
+		u64 child_bytenr;
+
+		/*
+		 * We need to get child blockptr/gen from parent before
+		 * we can read it.
+		  */
+		eb = dst_path->nodes[cur_level + 1];
+		parent_slot = dst_path->slots[cur_level + 1];
+		child_bytenr = btrfs_node_blockptr(eb, parent_slot);
+		child_gen = btrfs_node_ptr_generation(eb, parent_slot);
+		btrfs_node_key_to_cpu(eb, &first_key, parent_slot);
+
+		/* This node is old, no need to trace */
+		if (child_gen < last_snapshot)
+			goto out;
+
+		eb = read_tree_block(fs_info, child_bytenr, child_gen,
+				     cur_level, &first_key);
+		if (IS_ERR(eb)) {
+			ret = PTR_ERR(eb);
+			goto out;
+		} else if (!extent_buffer_uptodate(eb)) {
+			free_extent_buffer(eb);
+			ret = -EIO;
+			goto out;
+		}
+
+		dst_path->nodes[cur_level] = eb;
+		dst_path->slots[cur_level] = 0;
+
+		btrfs_tree_read_lock(eb);
+		btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+		dst_path->locks[cur_level] = BTRFS_READ_LOCK_BLOCKING;
+		need_cleanup = true;
+	}
+
+	/* Now record this tree block and its counter part for qgroups */
+	ret = qgroup_trace_extent_swap(trans, src_eb, dst_path, cur_level,
+				       root_level);
+	if (ret < 0)
+		goto cleanup;
+
+	eb = dst_path->nodes[cur_level];
+
+	if (cur_level > 0) {
+		/* Iterate all children tree blocks */
+		for (i = 0; i < btrfs_header_nritems(eb); i++) {
+			/* Skip old tree blocks as they won't be swapped */
+			if (btrfs_node_ptr_generation(eb, i) < last_snapshot)
+				continue;
+			dst_path->slots[cur_level] = i;
+
+			/* Recursive call (at most 7 times) */
+			ret = qgroup_trace_new_subtree_blocks(trans, src_eb,
+					dst_path, cur_level - 1, root_level,
+					last_snapshot);
+			if (ret < 0)
+				goto cleanup;
+		}
+	}
+
+cleanup:
+	if (need_cleanup) {
+		/* Clean up */
+		btrfs_tree_unlock_rw(dst_path->nodes[cur_level],
+				     dst_path->locks[cur_level]);
+		free_extent_buffer(dst_path->nodes[cur_level]);
+		dst_path->nodes[cur_level] = NULL;
+		dst_path->slots[cur_level] = 0;
+		dst_path->locks[cur_level] = 0;
+	}
+out:
+	return ret;
+}
+
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level)
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 4/7] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
  2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
                   ` (2 preceding siblings ...)
  2018-09-11  5:38 ` [PATCH v3 3/7] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree Qu Wenruo
@ 2018-09-11  5:38 ` Qu Wenruo
  2018-09-26 14:35   ` David Sterba
  2018-09-11  5:38 ` [PATCH v3 5/7] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2018-09-11  5:38 UTC (permalink / raw)
  To: linux-btrfs

Before this patch, for quota enabled balance, btrfs needs to mark the
whole subtree dirty for quota.

E.g.
OO = Old tree blocks (from file tree)
NN = New tree blocks (from reloc tree)

        File tree (src)		          Reloc tree (dst)
            OO (a)                              NN (a)
           /  \                                /  \
     (b) OO    OO (c)                    (b) NN    NN (c)
        /  \  /  \                          /  \  /  \
       OO  OO OO OO (d)                    OO  OO OO NN (d)

For old balance + quota case, quota will mark the whole src and dst tree
dirty, including all the 3 old tree blocks in reloc tree.

It's doable for small file tree or new tree blocks are all
located at lower level.

But for large file tree or new tree blocks are all located at higher
level, this will lead to mark the whole tree dirty, and be unbelievably
slow.

This patch will change how we handle such balance with quota enabled
case.

Now we will search from (b) and (c) for any new tree blocks whose generation
is equal to @last_snapshot, and only mark them dirty.

In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
(NN(a) will be traced when CoW happens for nodeptr modification).
And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when
CoW happens for nodeptr modification)

For above case, we could skip 3 tree blocks, but for larger tree, we can
skip tons of unmodified tree blocks, and hugely speed up balance.

This patch will introduce a new function,
btrfs_qgroup_trace_subtree_swap(), which will do the following main
work:

1) Read out real root eb
   And setup basic dst_path for later calls
2) Call qgroup_trace_new_subtree_blocks()
   To trace all new tree blocks in reloc tree and their counter
   parts in file tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c     | 94 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h     | 10 +++++
 fs/btrfs/relocation.c | 11 ++---
 3 files changed, 107 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 0702953d70a7..a94027b2620e 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1953,6 +1953,100 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
 	return ret;
 }
 
+/*
+ * For balance subtree swap.
+ *
+ * Will go down the tree block pointed by @dst_eb (pointed by @dst_parent and
+ * @dst_slot), and find any tree blocks whose generation is at @last_snapshot,
+ * and then go down @src_eb (pointed by @src_parent and @src_slot) to find
+ * the conterpart of the tree block, then mark both tree blocks as qgroup dirty,
+ * and skip all tree blocks whose generation is smaller than last_snapshot.
+ *
+ * This would skip tons of tree blocks of original btrfs_qgroup_trace_subtree(),
+ * which is the culprit of super slow balance if the file tree is large.
+ *
+ * @src_parent, @src_slot: pointer to src (file tree) eb.
+ * @dst_parent, @dst_slot: pointer to dst (tree reloc tree) eb.
+ */
+int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
+				struct extent_buffer *src_parent, int src_slot,
+				struct extent_buffer *dst_parent, int dst_slot,
+				u64 last_snapshot)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_path *dst_path = NULL;
+	struct btrfs_key first_key;
+	struct extent_buffer *src_eb = NULL;
+	struct extent_buffer *dst_eb = NULL;
+	u64 child_gen;
+	u64 child_bytenr;
+	int level;
+	int ret;
+
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+		return 0;
+
+	/* Wrong parameter order */
+	BUG_ON(btrfs_node_ptr_generation(src_parent, src_slot) >
+		btrfs_node_ptr_generation(dst_parent, dst_slot));
+
+	/* Read out real @src_eb, pointed by @src_parent and @src_slot */
+	child_bytenr = btrfs_node_blockptr(src_parent, src_slot);
+	child_gen = btrfs_node_ptr_generation(src_parent, src_slot);
+	btrfs_node_key_to_cpu(src_parent, &first_key, src_slot);
+
+	src_eb = read_tree_block(fs_info, child_bytenr, child_gen,
+			btrfs_header_level(src_parent) - 1, &first_key);
+	if (IS_ERR(src_eb)) {
+		ret = PTR_ERR(src_eb);
+		goto out;
+	}
+
+	/* Read out real @dst_eb, pointed by @src_parent and @src_slot */
+	child_bytenr = btrfs_node_blockptr(dst_parent, dst_slot);
+	child_gen = btrfs_node_ptr_generation(dst_parent, dst_slot);
+	btrfs_node_key_to_cpu(dst_parent, &first_key, dst_slot);
+
+	dst_eb = read_tree_block(fs_info, child_bytenr, child_gen,
+			btrfs_header_level(dst_parent) - 1, &first_key);
+	if (IS_ERR(dst_eb)) {
+		ret = PTR_ERR(dst_eb);
+		goto out;
+	}
+
+	if (!extent_buffer_uptodate(src_eb) || !extent_buffer_uptodate(dst_eb)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	level = btrfs_header_level(dst_eb);
+	dst_path = btrfs_alloc_path();
+	if (!dst_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* For dst_path */
+	extent_buffer_get(dst_eb);
+	dst_path->nodes[level] = dst_eb;
+	dst_path->slots[level] = 0;
+	dst_path->locks[level] = 0;
+
+	/* Do the generation aware breadth-first search */
+	ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
+					      level, last_snapshot);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+
+out:
+	free_extent_buffer(src_eb);
+	free_extent_buffer(dst_eb);
+	btrfs_free_path(dst_path);
+	if (ret < 0)
+		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
+	return ret;
+}
+
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level)
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 54b8bb282c0e..9f9943dfd493 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -236,6 +236,16 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level);
+
+/*
+ * Inform qgroup to trace subtree swap used in balance.
+ * Unlike btrfs_qgroup_trace_subtree(), this function will only trace
+ * new tree blocks whose generation is equal to (or larger than) @last_snapshot.
+ */
+int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
+				struct extent_buffer *src_parent, int src_slot,
+				struct extent_buffer *dst_parent, int dst_slot,
+				u64 last_snapshot);
 int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 				u64 num_bytes, struct ulist *old_roots,
 				struct ulist *new_roots);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8783a1776540..07ab61a740ae 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1879,14 +1879,9 @@ int replace_path(struct btrfs_trans_handle *trans,
 		 *    and tree block numbers, if current trans doesn't free
 		 *    data reloc tree inode.
 		 */
-		ret = btrfs_qgroup_trace_subtree(trans, parent,
-				btrfs_header_generation(parent),
-				btrfs_header_level(parent));
-		if (ret < 0)
-			break;
-		ret = btrfs_qgroup_trace_subtree(trans, path->nodes[level],
-				btrfs_header_generation(path->nodes[level]),
-				btrfs_header_level(path->nodes[level]));
+		ret = btrfs_qgroup_trace_subtree_swap(trans, parent, slot,
+				path->nodes[level], path->slots[level],
+				last_snapshot);
 		if (ret < 0)
 			break;
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 5/7] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
  2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
                   ` (3 preceding siblings ...)
  2018-09-11  5:38 ` [PATCH v3 4/7] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents Qu Wenruo
@ 2018-09-11  5:38 ` Qu Wenruo
  2018-09-26 14:35   ` David Sterba
  2018-09-11  5:38 ` [PATCH v3 6/7] btrfs: delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing Qu Wenruo
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2018-09-11  5:38 UTC (permalink / raw)
  To: linux-btrfs

Reloc tree doesn't contribute to qgroup numbers, as we have
accounted them at balance time (check replace_path()).

Skip such unneeded subtree trace should reduce some performance
overhead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-tree.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de6f75f5547b..4588153f414c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8643,7 +8643,13 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 			parent = 0;
 		}
 
-		if (need_account) {
+		/*
+		 * Tree reloc tree doesn't contribute to qgroup numbers, and
+		 * we have already accounted them at merge time (replace_path),
+		 * thus we could skip expensive subtree trace here.
+		 */
+		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
+		    need_account) {
 			ret = btrfs_qgroup_trace_subtree(trans, next,
 							 generation, level - 1);
 			if (ret) {
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 6/7] btrfs: delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing
  2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
                   ` (4 preceding siblings ...)
  2018-09-11  5:38 ` [PATCH v3 5/7] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree Qu Wenruo
@ 2018-09-11  5:38 ` Qu Wenruo
  2018-09-26 14:40   ` David Sterba
  2018-09-11  5:38 ` [PATCH v3 7/7] btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group Qu Wenruo
  2018-09-26 14:06 ` [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata David Sterba
  7 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2018-09-11  5:38 UTC (permalink / raw)
  To: linux-btrfs

For btrfs_add_delayed_tree_ref(), its ref_root parameter can be
different from its real root.
This is pretty common for reloc tree, in that case @ref_root is passed
as the original tree owner (source file tree).

However btrfs_add_delayed_tree_ref() uses @ref_root to determine whether
we should do heavy qgroup tracing for it, so for balance this could
cause a lot of unnecessary extent tracing.

This patch will introduce a parameter, @root for qgroup code to
determine whether we really need to do the heavy-lifting tracing work.

This patch would reduce the qgroup dirty extent number from 'number of
extents being relocated' to 0 for the transaction before real tree swap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/delayed-ref.c |  5 +++--
 fs/btrfs/delayed-ref.h |  1 +
 fs/btrfs/extent-tree.c | 17 ++++++++++-------
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 62ff545ba1f7..7f681dd36e7f 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -710,8 +710,9 @@ static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
  * transaction commits.
  */
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
+			       struct btrfs_root *root,
 			       u64 bytenr, u64 num_bytes, u64 parent,
-			       u64 ref_root,  int level, int action,
+			       u64 ref_root, int level, int action,
 			       struct btrfs_delayed_extent_op *extent_op,
 			       int *old_ref_mod, int *new_ref_mod)
 {
@@ -737,7 +738,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 	}
 
 	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
-	    is_fstree(ref_root)) {
+	    is_fstree(root->root_key.objectid)) {
 		record = kmalloc(sizeof(*record), GFP_NOFS);
 		if (!record) {
 			kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d9f2a4ebd5db..a51560a3cb66 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -235,6 +235,7 @@ static inline void btrfs_put_delayed_ref_head(struct btrfs_delayed_ref_head *hea
 }
 
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
+			       struct btrfs_root *root,
 			       u64 bytenr, u64 num_bytes, u64 parent,
 			       u64 ref_root, int level, int action,
 			       struct btrfs_delayed_extent_op *extent_op,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4588153f414c..7a3c650b11f8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2040,9 +2040,10 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 			   owner, offset, BTRFS_ADD_DELAYED_REF);
 
 	if (owner < BTRFS_FIRST_FREE_OBJECTID) {
-		ret = btrfs_add_delayed_tree_ref(trans, bytenr,
+		ret = btrfs_add_delayed_tree_ref(trans, root, bytenr,
 						 num_bytes, parent,
-						 root_objectid, (int)owner,
+						 root_objectid,
+						 (int)owner,
 						 BTRFS_ADD_DELAYED_REF, NULL,
 						 &old_ref_mod, &new_ref_mod);
 	} else {
@@ -6993,7 +6994,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 				   root->root_key.objectid,
 				   btrfs_header_level(buf), 0,
 				   BTRFS_DROP_DELAYED_REF);
-		ret = btrfs_add_delayed_tree_ref(trans, buf->start,
+		ret = btrfs_add_delayed_tree_ref(trans, root, buf->start,
 						 buf->len, parent,
 						 root->root_key.objectid,
 						 btrfs_header_level(buf),
@@ -7072,9 +7073,10 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 		old_ref_mod = new_ref_mod = 0;
 		ret = 0;
 	} else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
-		ret = btrfs_add_delayed_tree_ref(trans, bytenr,
+		ret = btrfs_add_delayed_tree_ref(trans, root, bytenr,
 						 num_bytes, parent,
-						 root_objectid, (int)owner,
+						 root_objectid,
+						 (int)owner,
 						 BTRFS_DROP_DELAYED_REF, NULL,
 						 &old_ref_mod, &new_ref_mod);
 	} else {
@@ -8293,9 +8295,10 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 		btrfs_ref_tree_mod(root, ins.objectid, ins.offset, parent,
 				   root_objectid, level, 0,
 				   BTRFS_ADD_DELAYED_EXTENT);
-		ret = btrfs_add_delayed_tree_ref(trans, ins.objectid,
+		ret = btrfs_add_delayed_tree_ref(trans, root, ins.objectid,
 						 ins.offset, parent,
-						 root_objectid, level,
+						 root_objectid,
+						 level,
 						 BTRFS_ADD_DELAYED_EXTENT,
 						 extent_op, NULL, NULL);
 		if (ret)
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 7/7] btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group
  2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
                   ` (5 preceding siblings ...)
  2018-09-11  5:38 ` [PATCH v3 6/7] btrfs: delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing Qu Wenruo
@ 2018-09-11  5:38 ` Qu Wenruo
  2018-09-26 14:06 ` [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata David Sterba
  7 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2018-09-11  5:38 UTC (permalink / raw)
  To: linux-btrfs

For qgroup_trace_extent_swap(), if we find one leaf needs to be traced,
btrfs will also iterate all file extents and trace them.

This is OK if we're relocating data block groups, but if we're
relocating metadata block groups, balance code itself has ensured that
both subtree of file tree and reloc tree contain the same contents.

That's to say, if we're relocating metadata block groups, all file
extents in reloc and file tree should match, thus no need to trace them.
This should reduce the total number of dirty extents processed in metadata
block group balance.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c     | 21 +++++++++++++++------
 fs/btrfs/qgroup.h     |  1 +
 fs/btrfs/relocation.c | 10 +++++-----
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a94027b2620e..f58350a37c2d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1725,7 +1725,8 @@ static int adjust_slots_upwards(struct btrfs_path *path, int root_level)
 static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
 				    struct extent_buffer *src_eb,
 				    struct btrfs_path *dst_path,
-				    int dst_level, int root_level)
+				    int dst_level, int root_level,
+				    bool trace_leaf)
 {
 	struct btrfs_key key;
 	struct btrfs_path *src_path;
@@ -1828,7 +1829,7 @@ static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
 		goto out;
 
 	/* Record leaf file extents */
-	if (dst_level == 0) {
+	if (dst_level == 0 && trace_leaf) {
 		ret = btrfs_qgroup_trace_leaf_items(trans, src_path->nodes[0]);
 		if (ret < 0)
 			goto out;
@@ -1865,7 +1866,7 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
 					   struct extent_buffer *src_eb,
 					   struct btrfs_path *dst_path,
 					   int cur_level, int root_level,
-					   u64 last_snapshot)
+					   u64 last_snapshot, bool trace_leaf)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct extent_buffer *eb;
@@ -1916,7 +1917,7 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
 
 	/* Now record this tree block and its counter part for qgroups */
 	ret = qgroup_trace_extent_swap(trans, src_eb, dst_path, cur_level,
-				       root_level);
+				       root_level, trace_leaf);
 	if (ret < 0)
 		goto cleanup;
 
@@ -1933,7 +1934,7 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
 			/* Recursive call (at most 7 times) */
 			ret = qgroup_trace_new_subtree_blocks(trans, src_eb,
 					dst_path, cur_level - 1, root_level,
-					last_snapshot);
+					last_snapshot, trace_leaf);
 			if (ret < 0)
 				goto cleanup;
 		}
@@ -1969,6 +1970,7 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
  * @dst_parent, @dst_slot: pointer to dst (tree reloc tree) eb.
  */
 int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
+				struct btrfs_block_group_cache *bg_cache,
 				struct extent_buffer *src_parent, int src_slot,
 				struct extent_buffer *dst_parent, int dst_slot,
 				u64 last_snapshot)
@@ -1978,6 +1980,7 @@ int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 	struct btrfs_key first_key;
 	struct extent_buffer *src_eb = NULL;
 	struct extent_buffer *dst_eb = NULL;
+	bool trace_leaf = false;
 	u64 child_gen;
 	u64 child_bytenr;
 	int level;
@@ -1990,6 +1993,12 @@ int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 	BUG_ON(btrfs_node_ptr_generation(src_parent, src_slot) >
 		btrfs_node_ptr_generation(dst_parent, dst_slot));
 
+	/*
+	 * Only trace leaf if we're relocating data block groups, this could
+	 * reduce tons of data extents tracing for meta/sys bg relocation.
+	 */
+	if (bg_cache->flags & BTRFS_BLOCK_GROUP_DATA)
+		trace_leaf = true;
 	/* Read out real @src_eb, pointed by @src_parent and @src_slot */
 	child_bytenr = btrfs_node_blockptr(src_parent, src_slot);
 	child_gen = btrfs_node_ptr_generation(src_parent, src_slot);
@@ -2033,7 +2042,7 @@ int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 
 	/* Do the generation aware breadth-first search */
 	ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
-					      level, last_snapshot);
+					      level, last_snapshot, trace_leaf);
 	if (ret < 0)
 		goto out;
 	ret = 0;
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 9f9943dfd493..5289cf0e5e6b 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -243,6 +243,7 @@ int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
  * new tree blocks whose generation is equal to (or larger than) @last_snapshot.
  */
 int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
+				struct btrfs_block_group_cache *bg_cache,
 				struct extent_buffer *src_parent, int src_slot,
 				struct extent_buffer *dst_parent, int dst_slot,
 				u64 last_snapshot);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 07ab61a740ae..44efde9886fc 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1735,7 +1735,7 @@ int memcmp_node_keys(struct extent_buffer *eb, int slot,
  * errors, a negative error number is returned.
  */
 static noinline_for_stack
-int replace_path(struct btrfs_trans_handle *trans,
+int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 		 struct btrfs_root *dest, struct btrfs_root *src,
 		 struct btrfs_path *path, struct btrfs_key *next_key,
 		 int lowest_level, int max_level)
@@ -1879,9 +1879,9 @@ int replace_path(struct btrfs_trans_handle *trans,
 		 *    and tree block numbers, if current trans doesn't free
 		 *    data reloc tree inode.
 		 */
-		ret = btrfs_qgroup_trace_subtree_swap(trans, parent, slot,
-				path->nodes[level], path->slots[level],
-				last_snapshot);
+		ret = btrfs_qgroup_trace_subtree_swap(trans, rc->block_group,
+				parent, slot, path->nodes[level],
+				path->slots[level], last_snapshot);
 		if (ret < 0)
 			break;
 
@@ -2200,7 +2200,7 @@ static noinline_for_stack int merge_reloc_root(struct reloc_control *rc,
 		    btrfs_comp_cpu_keys(&next_key, &key) >= 0) {
 			ret = 0;
 		} else {
-			ret = replace_path(trans, root, reloc_root, path,
+			ret = replace_path(trans, rc, root, reloc_root, path,
 					   &next_key, level, max_level);
 		}
 		if (ret < 0) {
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata
  2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
                   ` (6 preceding siblings ...)
  2018-09-11  5:38 ` [PATCH v3 7/7] btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group Qu Wenruo
@ 2018-09-26 14:06 ` David Sterba
  2018-09-26 14:17   ` Qu Wenruo
  7 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2018-09-26 14:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Sep 11, 2018 at 01:38:11PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees
> The base commit is v4.19-rc1 tag.

I want to merge this patchset to 4.20, it's been in for-next for some
time and it addresses a problem that gets reported very often.

I'm still doing some tests, there are minor issues that are not
blocking, more code review is always welcome.

> There are a lot of reports of system hang for balance on quota enabled
> fs.
> It's most obvious for large fs.
> 
> The hang is caused by tons of unmodified extents marked as qgroup dirty.
> Such unmodified/unrelated sources include:
> 1) Unmodified subtree
> 2) Subtree drop for reloc tree
> (BTW, other sources includes unmodified file extent items)
> 
> E.g.
> OO = Old tree blocks from file tree
> NN = New tree blocks from reloc tree
> 
>         file tree                              reloc tree
>            OO (a)                                  NN (a)
>           /  \                                    /  \
>     (b) OO    OO (c)                        (b) NN    NN (c)
>        / \   / \                               / \   / \
>      OO  OO OO  OO                           OO  OO OO  NN
>     (d) (e) (f) (g)                         (d) (e) (f) (g)
> 
> In above case, balance will modify nodeptr in OO(a) to point NN(b) and
> NN(c), and modify NN(a) to point to OO(B) and OO(c).
> 
> Before this patch, quota will mark the whole subtree from its parent
> down to the leaves as dirty.
> So btrfs quota need to trace all tree block from (a) to (g).

I find the use of 'trace' a bit confusing as there are the trace events,
but as it's already used in the qgroup code I think it can stay as is.

> However tree blocks (d) (e) (f) are shared between both trees, thus
> there is no need to trace those 3 tree blocks.
> 
> This patchset will change how this work by only tracing modified tree
> blocks in reloc tree, and their counter parts in file tree.
> 
> With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
> thus reduce some some overhead caused by qgroup.
> 
> The improvement is mostly related to metadata relocation.
> 
> Also, for metadata relocation, we don't really need to trace data
> extents as they're not modified at all.
> 
> 
> [[Benchmark]]
[...]

Thanks for the benchmarks, it would be good to add the information to
the patch that's bringing the performance improvement.

> Hardware:
> 	VM 4G vRAM, 8 vCPUs,
> 	disk is using 'unsafe' cache mode,
> 	backing device is SAMSUNG 850 evo SSD.
> 	Host has 16G ram.
> 
> Mkfs parameter:
> 	--nodesize 4K (To bump up tree size)
> 
> Initial subvolume contents:
> 	4G data copied from /usr and /lib.
> 	(With enough regular small files)
> 
> Snapshots:
> 	16 snapshots of the original subvolume.
> 	each snapshot has 3 random files modified.
> 
> balance parameter:
> 	-m
> 
> So the content should be pretty similar to a real world root fs layout.
> 
>                      | v4.19-rc1    | w/ patchset    | diff (*)
> ---------------------------------------------------------------
> relocated extents    | 22874        | 22856          | +0.1%
> qgroup dirty extents | 225251       | 140431         | -37.7%
> time (sys)           | 40.161s      | 20.574s        | -48.7%
> time (real)          | 42.163s      | 25.173s        | -40.3%
> 
> *: (val_new - val_old) / val_old * 100%
> 
> And the difference is becoming more and more obvious if more snapshots
> are created.
> 
> If we increase the number of snapshots to 64 (4 times the number of
> references, and 64 snapshots is already not recommended to use with
> quota)
> 
>                      | v4.19-rc1    | w/ patchset    | diff (*)
> ---------------------------------------------------------------
> relocated extents    | 22462        | 22467          | +0.0%
> qgroup dirty extents | 314380       | 140826         | -55.2%
> time (sys)           | 158.033s     | 74.292s        | -53.0%
> time (real)          | 197.395s     | 90.529s        | -67.6%
> 
> For larger fs the saving should be even more obvious.
> 
> Changelog:
> v2:
>   Rename "tree reloc tree" to "reloc tree".
>   Add patch "Don't trace subtree if we're dropping reloc tree" into the
>   patchset.
>   Fix wrong btrfs_bin_search() call, which leads to unexpected ENOENT
>   error for btrfs_qgroup_trace_extent_swap(). Now use dst_path->slots[]
>   directly.
>   
> v3:
>   Add new patch to avoid unnecessary data extents trace for metadata
>   relocation.
>   Better delayed ref time root owner detection to avoid unnecessary tree
>   block tracing.
>   Add benchmark result for the patchset.
> 
> Qu Wenruo (7):
>   btrfs: qgroup: Introduce trace event to analyse the number of dirty
>     extents accounted
>   btrfs: qgroup: Introduce function to trace two swaped extents
>   btrfs: qgroup: Introduce function to find all new tree blocks of reloc
>     tree
>   btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
>   btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
>   btrfs: delayed-ref: Introduce new parameter for
>     btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing
>   btrfs: qgroup: Only trace data extents in leaves if we're relocating
>     data block group
> 
>  fs/btrfs/delayed-ref.c       |   5 +-
>  fs/btrfs/delayed-ref.h       |   1 +
>  fs/btrfs/extent-tree.c       |  25 ++-
>  fs/btrfs/qgroup.c            | 347 +++++++++++++++++++++++++++++++++++
>  fs/btrfs/qgroup.h            |  11 ++
>  fs/btrfs/relocation.c        |  15 +-
>  include/trace/events/btrfs.h |  21 +++
>  7 files changed, 405 insertions(+), 20 deletions(-)
> 
> -- 
> 2.18.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/7] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted
  2018-09-11  5:38 ` [PATCH v3 1/7] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
@ 2018-09-26 14:06   ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2018-09-26 14:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Sep 11, 2018 at 01:38:12PM +0800, Qu Wenruo wrote:
> Number of qgroup dirty extents is directly linked to the performance
> overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
> record how many dirty extents is processed in
> btrfs_qgroup_account_extents().
> 
> This will be pretty handy to analyse later balance performance
> improvement.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata
  2018-09-26 14:06 ` [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata David Sterba
@ 2018-09-26 14:17   ` Qu Wenruo
  2018-09-26 14:25     ` Qu Wenruo
  2018-09-26 14:26     ` David Sterba
  0 siblings, 2 replies; 20+ messages in thread
From: Qu Wenruo @ 2018-09-26 14:17 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6366 bytes --]



On 2018/9/26 下午10:06, David Sterba wrote:
> On Tue, Sep 11, 2018 at 01:38:11PM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees
>> The base commit is v4.19-rc1 tag.
> 
> I want to merge this patchset to 4.20, it's been in for-next for some
> time and it addresses a problem that gets reported very often.
> 
> I'm still doing some tests, there are minor issues that are not
> blocking, more code review is always welcome.
> 
>> There are a lot of reports of system hang for balance on quota enabled
>> fs.
>> It's most obvious for large fs.
>>
>> The hang is caused by tons of unmodified extents marked as qgroup dirty.
>> Such unmodified/unrelated sources include:
>> 1) Unmodified subtree
>> 2) Subtree drop for reloc tree
>> (BTW, other sources includes unmodified file extent items)
>>
>> E.g.
>> OO = Old tree blocks from file tree
>> NN = New tree blocks from reloc tree
>>
>>         file tree                              reloc tree
>>            OO (a)                                  NN (a)
>>           /  \                                    /  \
>>     (b) OO    OO (c)                        (b) NN    NN (c)
>>        / \   / \                               / \   / \
>>      OO  OO OO  OO                           OO  OO OO  NN
>>     (d) (e) (f) (g)                         (d) (e) (f) (g)
>>
>> In above case, balance will modify nodeptr in OO(a) to point NN(b) and
>> NN(c), and modify NN(a) to point to OO(B) and OO(c).
>>
>> Before this patch, quota will mark the whole subtree from its parent
>> down to the leaves as dirty.
>> So btrfs quota need to trace all tree block from (a) to (g).
> 
> I find the use of 'trace' a bit confusing as there are the trace events,
> but as it's already used in the qgroup code I think it can stay as is.

I'm pretty open to rename the word "trace".

Any recommendation for a newer name?

> 
>> However tree blocks (d) (e) (f) are shared between both trees, thus
>> there is no need to trace those 3 tree blocks.
>>
>> This patchset will change how this work by only tracing modified tree
>> blocks in reloc tree, and their counter parts in file tree.
>>
>> With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
>> thus reduce some some overhead caused by qgroup.
>>
>> The improvement is mostly related to metadata relocation.
>>
>> Also, for metadata relocation, we don't really need to trace data
>> extents as they're not modified at all.
>>
>>
>> [[Benchmark]]
> [...]
> 
> Thanks for the benchmarks, it would be good to add the information to
> the patch that's bringing the performance improvement.

Could do that.

However the current improved (from around 10 * nr_relocated to 6~7 *
nr_relocated) is the final result of the last 4 patches.

I'm not pretty sure if any single patch could bring obvious improvement.

> 
>> Hardware:
>> 	VM 4G vRAM, 8 vCPUs,
>> 	disk is using 'unsafe' cache mode,
>> 	backing device is SAMSUNG 850 evo SSD.
>> 	Host has 16G ram.
>>
>> Mkfs parameter:
>> 	--nodesize 4K (To bump up tree size)
>>
>> Initial subvolume contents:
>> 	4G data copied from /usr and /lib.
>> 	(With enough regular small files)
>>
>> Snapshots:
>> 	16 snapshots of the original subvolume.
>> 	each snapshot has 3 random files modified.
>>
>> balance parameter:
>> 	-m
>>
>> So the content should be pretty similar to a real world root fs layout.
>>
>>                      | v4.19-rc1    | w/ patchset    | diff (*)
>> ---------------------------------------------------------------
>> relocated extents    | 22874        | 22856          | +0.1%
>> qgroup dirty extents | 225251       | 140431         | -37.7%
>> time (sys)           | 40.161s      | 20.574s        | -48.7%
>> time (real)          | 42.163s      | 25.173s        | -40.3%
>>
>> *: (val_new - val_old) / val_old * 100%
>>
>> And the difference is becoming more and more obvious if more snapshots
>> are created.
>>
>> If we increase the number of snapshots to 64 (4 times the number of
>> references, and 64 snapshots is already not recommended to use with
>> quota)
>>
>>                      | v4.19-rc1    | w/ patchset    | diff (*)
>> ---------------------------------------------------------------
>> relocated extents    | 22462        | 22467          | +0.0%
>> qgroup dirty extents | 314380       | 140826         | -55.2%
>> time (sys)           | 158.033s     | 74.292s        | -53.0%
>> time (real)          | 197.395s     | 90.529s        | -67.6%
>>
>> For larger fs the saving should be even more obvious.
>>
>> Changelog:
>> v2:
>>   Rename "tree reloc tree" to "reloc tree".
>>   Add patch "Don't trace subtree if we're dropping reloc tree" into the
>>   patchset.
>>   Fix wrong btrfs_bin_search() call, which leads to unexpected ENOENT
>>   error for btrfs_qgroup_trace_extent_swap(). Now use dst_path->slots[]
>>   directly.
>>   
>> v3:
>>   Add new patch to avoid unnecessary data extents trace for metadata
>>   relocation.
>>   Better delayed ref time root owner detection to avoid unnecessary tree
>>   block tracing.
>>   Add benchmark result for the patchset.
>>
>> Qu Wenruo (7):
>>   btrfs: qgroup: Introduce trace event to analyse the number of dirty
>>     extents accounted
>>   btrfs: qgroup: Introduce function to trace two swaped extents
>>   btrfs: qgroup: Introduce function to find all new tree blocks of reloc
>>     tree
>>   btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
>>   btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
>>   btrfs: delayed-ref: Introduce new parameter for
>>     btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing
>>   btrfs: qgroup: Only trace data extents in leaves if we're relocating
>>     data block group
>>
>>  fs/btrfs/delayed-ref.c       |   5 +-
>>  fs/btrfs/delayed-ref.h       |   1 +
>>  fs/btrfs/extent-tree.c       |  25 ++-
>>  fs/btrfs/qgroup.c            | 347 +++++++++++++++++++++++++++++++++++
>>  fs/btrfs/qgroup.h            |  11 ++
>>  fs/btrfs/relocation.c        |  15 +-
>>  include/trace/events/btrfs.h |  21 +++
>>  7 files changed, 405 insertions(+), 20 deletions(-)
>>
>> -- 
>> 2.18.0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata
  2018-09-26 14:17   ` Qu Wenruo
@ 2018-09-26 14:25     ` Qu Wenruo
  2018-09-26 14:26     ` David Sterba
  1 sibling, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2018-09-26 14:25 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7446 bytes --]



On 2018/9/26 下午10:17, Qu Wenruo wrote:
> 
> 
> On 2018/9/26 下午10:06, David Sterba wrote:
>> On Tue, Sep 11, 2018 at 01:38:11PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees
>>> The base commit is v4.19-rc1 tag.
>>
>> I want to merge this patchset to 4.20, it's been in for-next for some
>> time and it addresses a problem that gets reported very often.
>>
>> I'm still doing some tests, there are minor issues that are not
>> blocking, more code review is always welcome.
>>
>>> There are a lot of reports of system hang for balance on quota enabled
>>> fs.
>>> It's most obvious for large fs.
>>>
>>> The hang is caused by tons of unmodified extents marked as qgroup dirty.
>>> Such unmodified/unrelated sources include:
>>> 1) Unmodified subtree
>>> 2) Subtree drop for reloc tree
>>> (BTW, other sources includes unmodified file extent items)
>>>
>>> E.g.
>>> OO = Old tree blocks from file tree
>>> NN = New tree blocks from reloc tree
>>>
>>>         file tree                              reloc tree
>>>            OO (a)                                  NN (a)
>>>           /  \                                    /  \
>>>     (b) OO    OO (c)                        (b) NN    NN (c)
>>>        / \   / \                               / \   / \
>>>      OO  OO OO  OO                           OO  OO OO  NN
>>>     (d) (e) (f) (g)                         (d) (e) (f) (g)
>>>
>>> In above case, balance will modify nodeptr in OO(a) to point NN(b) and
>>> NN(c), and modify NN(a) to point to OO(B) and OO(c).
>>>
>>> Before this patch, quota will mark the whole subtree from its parent
>>> down to the leaves as dirty.
>>> So btrfs quota need to trace all tree block from (a) to (g).
>>
>> I find the use of 'trace' a bit confusing as there are the trace events,
>> but as it's already used in the qgroup code I think it can stay as is.
> 
> I'm pretty open to rename the word "trace".
> 
> Any recommendation for a newer name?
> 
>>
>>> However tree blocks (d) (e) (f) are shared between both trees, thus
>>> there is no need to trace those 3 tree blocks.
>>>
>>> This patchset will change how this work by only tracing modified tree
>>> blocks in reloc tree, and their counter parts in file tree.
>>>
>>> With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
>>> thus reduce some some overhead caused by qgroup.
>>>
>>> The improvement is mostly related to metadata relocation.
>>>
>>> Also, for metadata relocation, we don't really need to trace data
>>> extents as they're not modified at all.
>>>
>>>
>>> [[Benchmark]]
>> [...]
>>
>> Thanks for the benchmarks, it would be good to add the information to
>> the patch that's bringing the performance improvement.
> 
> Could do that.
> 
> However the current improved (from around 10 * nr_relocated to 6~7 *
> nr_relocated) is the final result of the last 4 patches.
> 
> I'm not pretty sure if any single patch could bring obvious improvement.
> 
>>
>>> Hardware:
>>> 	VM 4G vRAM, 8 vCPUs,
>>> 	disk is using 'unsafe' cache mode,
>>> 	backing device is SAMSUNG 850 evo SSD.
>>> 	Host has 16G ram.
>>>
>>> Mkfs parameter:
>>> 	--nodesize 4K (To bump up tree size)
>>>
>>> Initial subvolume contents:
>>> 	4G data copied from /usr and /lib.
>>> 	(With enough regular small files)
>>>
>>> Snapshots:
>>> 	16 snapshots of the original subvolume.
>>> 	each snapshot has 3 random files modified.
>>>
>>> balance parameter:
>>> 	-m
>>>
>>> So the content should be pretty similar to a real world root fs layout.
>>>
>>>                      | v4.19-rc1    | w/ patchset    | diff (*)
>>> ---------------------------------------------------------------
>>> relocated extents    | 22874        | 22856          | +0.1%
>>> qgroup dirty extents | 225251       | 140431         | -37.7%
>>> time (sys)           | 40.161s      | 20.574s        | -48.7%
>>> time (real)          | 42.163s      | 25.173s        | -40.3%
>>>
>>> *: (val_new - val_old) / val_old * 100%
>>>
>>> And the difference is becoming more and more obvious if more snapshots
>>> are created.
>>>
>>> If we increase the number of snapshots to 64 (4 times the number of
>>> references, and 64 snapshots is already not recommended to use with
>>> quota)
>>>
>>>                      | v4.19-rc1    | w/ patchset    | diff (*)
>>> ---------------------------------------------------------------
>>> relocated extents    | 22462        | 22467          | +0.0%
>>> qgroup dirty extents | 314380       | 140826         | -55.2%
>>> time (sys)           | 158.033s     | 74.292s        | -53.0%
>>> time (real)          | 197.395s     | 90.529s        | -67.6%

BTW, this result is still far from perfection.
(and in real world case, due to extra disk IO and tree read seek, the
improvement could be less obvious)

The theoretical minimal of qgroup dirty extents should be only 2 *
nr_relocated (maybe 3, at most 4 due to other overhead, but definitely
not current 6~7).

And further more, the real big part is the backref walk code.
Currently we're using btrfs_find_all_roots(), which doesn't do any cache
for backref.

We have a cached backref implementation in
relocation.c::build_backref_tree().
Jeff is working on a global version of it with even better roots result
cache.
Which should make total time linear to qgroup dirty extents.

But currently, I'm working to reduce nr_relocated to theoretical minimal
first.
(And enhance the comment for any potential reviewer)

Thanks,
Qu

>>>
>>> For larger fs the saving should be even more obvious.
>>>
>>> Changelog:
>>> v2:
>>>   Rename "tree reloc tree" to "reloc tree".
>>>   Add patch "Don't trace subtree if we're dropping reloc tree" into the
>>>   patchset.
>>>   Fix wrong btrfs_bin_search() call, which leads to unexpected ENOENT
>>>   error for btrfs_qgroup_trace_extent_swap(). Now use dst_path->slots[]
>>>   directly.
>>>   
>>> v3:
>>>   Add new patch to avoid unnecessary data extents trace for metadata
>>>   relocation.
>>>   Better delayed ref time root owner detection to avoid unnecessary tree
>>>   block tracing.
>>>   Add benchmark result for the patchset.
>>>
>>> Qu Wenruo (7):
>>>   btrfs: qgroup: Introduce trace event to analyse the number of dirty
>>>     extents accounted
>>>   btrfs: qgroup: Introduce function to trace two swaped extents
>>>   btrfs: qgroup: Introduce function to find all new tree blocks of reloc
>>>     tree
>>>   btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
>>>   btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
>>>   btrfs: delayed-ref: Introduce new parameter for
>>>     btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing
>>>   btrfs: qgroup: Only trace data extents in leaves if we're relocating
>>>     data block group
>>>
>>>  fs/btrfs/delayed-ref.c       |   5 +-
>>>  fs/btrfs/delayed-ref.h       |   1 +
>>>  fs/btrfs/extent-tree.c       |  25 ++-
>>>  fs/btrfs/qgroup.c            | 347 +++++++++++++++++++++++++++++++++++
>>>  fs/btrfs/qgroup.h            |  11 ++
>>>  fs/btrfs/relocation.c        |  15 +-
>>>  include/trace/events/btrfs.h |  21 +++
>>>  7 files changed, 405 insertions(+), 20 deletions(-)
>>>
>>> -- 
>>> 2.18.0
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata
  2018-09-26 14:17   ` Qu Wenruo
  2018-09-26 14:25     ` Qu Wenruo
@ 2018-09-26 14:26     ` David Sterba
  1 sibling, 0 replies; 20+ messages in thread
From: David Sterba @ 2018-09-26 14:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Wed, Sep 26, 2018 at 10:17:13PM +0800, Qu Wenruo wrote:
> >> Before this patch, quota will mark the whole subtree from its parent
> >> down to the leaves as dirty.
> >> So btrfs quota need to trace all tree block from (a) to (g).
> > 
> > I find the use of 'trace' a bit confusing as there are the trace events,
> > but as it's already used in the qgroup code I think it can stay as is.
> 
> I'm pretty open to rename the word "trace".
> 
> Any recommendation for a newer name?

No, we can rename the functions later if needed.

> >> However tree blocks (d) (e) (f) are shared between both trees, thus
> >> there is no need to trace those 3 tree blocks.
> >>
> >> This patchset will change how this work by only tracing modified tree
> >> blocks in reloc tree, and their counter parts in file tree.
> >>
> >> With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
> >> thus reduce some some overhead caused by qgroup.
> >>
> >> The improvement is mostly related to metadata relocation.
> >>
> >> Also, for metadata relocation, we don't really need to trace data
> >> extents as they're not modified at all.
> >>
> >>
> >> [[Benchmark]]
> > [...]
> > 
> > Thanks for the benchmarks, it would be good to add the information to
> > the patch that's bringing the performance improvement.
> 
> Could do that.
> 
> However the current improved (from around 10 * nr_relocated to 6~7 *
> nr_relocated) is the final result of the last 4 patches.
> 
> I'm not pretty sure if any single patch could bring obvious improvement.

Patch 5 would be a good candidate, noting that further patches bring
more performance improvements.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/7] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree
  2018-09-11  5:38 ` [PATCH v3 3/7] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree Qu Wenruo
@ 2018-09-26 14:29   ` David Sterba
  2018-09-26 14:40     ` Qu Wenruo
  0 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2018-09-26 14:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Sep 11, 2018 at 01:38:14PM +0800, Qu Wenruo wrote:
> Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
> all new tree blocks in a reloc tree.
> So that qgroup could skip unrelated tree blocks during balance, which
> should hugely speedup balance speed when quota is enabled.
> 
> The function qgroup_trace_new_subtree_blocks() itself only cares about
> new tree blocks in reloc tree.
> 
> All its main works are:
> 
> 1) Read out tree blocks according to parent pointers
> 
> 2) Do recursive depth-first search
>    Will call the same function on all its children tree blocks, with
>    search level set to current level -1.
>    And will also skip all children whose generation is smaller than
>    @last_snapshot.
> 
> 3) Call qgroup_trace_extent_swap() to trace tree blocks
> 
> So although we have parameter list related to source file tree, it's not
> used at all, but only passed to qgroup_trace_extent_swap().
> Thus despite the tree read code, the core should be pretty short and all
> about recursive depth-first search.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/qgroup.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 114 insertions(+)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 5155fb42ce79..0702953d70a7 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1839,6 +1839,120 @@ static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
>  	return ret;
>  }
>  
> +/*
> + * Helper function to do recursive generation aware depth-first search, to
> + * locate all new tree blocks in a subtree of tree reloc tree.
> + *
> + * E.g. (OO = Old tree blocks, NN = New tree blocks, whose gen == last_snapshot)
> + *       Tree reloc tree
> + * L2         NN (a)
> + *          /    \
> + * L1    OO        NN (b)
> + *      /  \      /  \
> + * L0  OO  OO    OO  NN
> + *               (c) (d)
> + * If we pass:
> + * @dst_path = [ nodes[1] = NN(b), nodes[0] = NULL ],
> + * @cur_level = 1
> + * @root_level = 1
> + *
> + * We will iterate through tree blocks NN(b), NN(d) and info qgroup to trace
> + * above tree blocks along with their counter parts in file tree.
> + * While during search, old tree blocsk OO(c) will be skiped as tree block swap
> + * won't affect OO(c).
> + */
> +static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
> +					   struct extent_buffer *src_eb,
> +					   struct btrfs_path *dst_path,
> +					   int cur_level, int root_level,
> +					   u64 last_snapshot)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct extent_buffer *eb;
> +	bool need_cleanup = false;
> +	int ret = 0;
> +	int i;

This function could be called recursively based on the value of
cur_level so please add some sanity checks so this does not accidentally
escape. It needs to be a plain if/return, not just an assert or bug-on.

> +
> +	/* Read the tree block if needed */
> +	if (dst_path->nodes[cur_level] == NULL) {
> +		struct btrfs_key first_key;
> +		int parent_slot;
> +		u64 child_gen;
> +		u64 child_bytenr;
> +
> +		/*
> +		 * We need to get child blockptr/gen from parent before
> +		 * we can read it.
> +		  */
> +		eb = dst_path->nodes[cur_level + 1];
> +		parent_slot = dst_path->slots[cur_level + 1];
> +		child_bytenr = btrfs_node_blockptr(eb, parent_slot);
> +		child_gen = btrfs_node_ptr_generation(eb, parent_slot);
> +		btrfs_node_key_to_cpu(eb, &first_key, parent_slot);
> +
> +		/* This node is old, no need to trace */
> +		if (child_gen < last_snapshot)
> +			goto out;
> +
> +		eb = read_tree_block(fs_info, child_bytenr, child_gen,
> +				     cur_level, &first_key);
> +		if (IS_ERR(eb)) {
> +			ret = PTR_ERR(eb);
> +			goto out;
> +		} else if (!extent_buffer_uptodate(eb)) {
> +			free_extent_buffer(eb);
> +			ret = -EIO;
> +			goto out;
> +		}
> +
> +		dst_path->nodes[cur_level] = eb;
> +		dst_path->slots[cur_level] = 0;
> +
> +		btrfs_tree_read_lock(eb);
> +		btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
> +		dst_path->locks[cur_level] = BTRFS_READ_LOCK_BLOCKING;
> +		need_cleanup = true;
> +	}
> +
> +	/* Now record this tree block and its counter part for qgroups */
> +	ret = qgroup_trace_extent_swap(trans, src_eb, dst_path, cur_level,
> +				       root_level);
> +	if (ret < 0)
> +		goto cleanup;
> +
> +	eb = dst_path->nodes[cur_level];
> +
> +	if (cur_level > 0) {
> +		/* Iterate all children tree blocks */
> +		for (i = 0; i < btrfs_header_nritems(eb); i++) {
> +			/* Skip old tree blocks as they won't be swapped */
> +			if (btrfs_node_ptr_generation(eb, i) < last_snapshot)
> +				continue;
> +			dst_path->slots[cur_level] = i;
> +
> +			/* Recursive call (at most 7 times) */
> +			ret = qgroup_trace_new_subtree_blocks(trans, src_eb,
> +					dst_path, cur_level - 1, root_level,
> +					last_snapshot);
> +			if (ret < 0)
> +				goto cleanup;
> +		}
> +	}
> +
> +cleanup:
> +	if (need_cleanup) {
> +		/* Clean up */
> +		btrfs_tree_unlock_rw(dst_path->nodes[cur_level],
> +				     dst_path->locks[cur_level]);
> +		free_extent_buffer(dst_path->nodes[cur_level]);
> +		dst_path->nodes[cur_level] = NULL;
> +		dst_path->slots[cur_level] = 0;
> +		dst_path->locks[cur_level] = 0;
> +	}
> +out:
> +	return ret;
> +}
> +
>  int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>  			       struct extent_buffer *root_eb,
>  			       u64 root_gen, int root_level)
> -- 
> 2.18.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/7] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
  2018-09-11  5:38 ` [PATCH v3 4/7] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents Qu Wenruo
@ 2018-09-26 14:35   ` David Sterba
  2018-09-26 14:43     ` Qu Wenruo
  0 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2018-09-26 14:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Sep 11, 2018 at 01:38:15PM +0800, Qu Wenruo wrote:
> Before this patch, for quota enabled balance, btrfs needs to mark the
> whole subtree dirty for quota.
> 
> E.g.
> OO = Old tree blocks (from file tree)
> NN = New tree blocks (from reloc tree)
> 
>         File tree (src)		          Reloc tree (dst)
>             OO (a)                              NN (a)
>            /  \                                /  \
>      (b) OO    OO (c)                    (b) NN    NN (c)
>         /  \  /  \                          /  \  /  \
>        OO  OO OO OO (d)                    OO  OO OO NN (d)
> 
> For old balance + quota case, quota will mark the whole src and dst tree
> dirty, including all the 3 old tree blocks in reloc tree.
> 
> It's doable for small file tree or new tree blocks are all
> located at lower level.
> 
> But for large file tree or new tree blocks are all located at higher
> level, this will lead to mark the whole tree dirty, and be unbelievably
> slow.
> 
> This patch will change how we handle such balance with quota enabled
> case.
> 
> Now we will search from (b) and (c) for any new tree blocks whose generation
> is equal to @last_snapshot, and only mark them dirty.
> 
> In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
> (NN(a) will be traced when CoW happens for nodeptr modification).
> And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when
> CoW happens for nodeptr modification)
> 
> For above case, we could skip 3 tree blocks, but for larger tree, we can
> skip tons of unmodified tree blocks, and hugely speed up balance.
> 
> This patch will introduce a new function,
> btrfs_qgroup_trace_subtree_swap(), which will do the following main
> work:
> 
> 1) Read out real root eb
>    And setup basic dst_path for later calls
> 2) Call qgroup_trace_new_subtree_blocks()
>    To trace all new tree blocks in reloc tree and their counter
>    parts in file tree.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/qgroup.c     | 94 +++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/qgroup.h     | 10 +++++
>  fs/btrfs/relocation.c | 11 ++---
>  3 files changed, 107 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 0702953d70a7..a94027b2620e 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1953,6 +1953,100 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
>  	return ret;
>  }
>  
> +/*
> + * For balance subtree swap.
> + *
> + * Will go down the tree block pointed by @dst_eb (pointed by @dst_parent and
> + * @dst_slot), and find any tree blocks whose generation is at @last_snapshot,
> + * and then go down @src_eb (pointed by @src_parent and @src_slot) to find
> + * the conterpart of the tree block, then mark both tree blocks as qgroup dirty,
> + * and skip all tree blocks whose generation is smaller than last_snapshot.
> + *
> + * This would skip tons of tree blocks of original btrfs_qgroup_trace_subtree(),
> + * which is the culprit of super slow balance if the file tree is large.
> + *
> + * @src_parent, @src_slot: pointer to src (file tree) eb.
> + * @dst_parent, @dst_slot: pointer to dst (tree reloc tree) eb.
> + */
> +int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
> +				struct extent_buffer *src_parent, int src_slot,
> +				struct extent_buffer *dst_parent, int dst_slot,
> +				u64 last_snapshot)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_path *dst_path = NULL;
> +	struct btrfs_key first_key;
> +	struct extent_buffer *src_eb = NULL;
> +	struct extent_buffer *dst_eb = NULL;
> +	u64 child_gen;
> +	u64 child_bytenr;
> +	int level;
> +	int ret;
> +
> +	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +		return 0;
> +
> +	/* Wrong parameter order */
> +	BUG_ON(btrfs_node_ptr_generation(src_parent, src_slot) >
> +		btrfs_node_ptr_generation(dst_parent, dst_slot));

If this is to catch errors development-time errors, then an ASSERT is
probably better, but it looks like this needs to be an ordinary runtime
sanity check too.

> +
> +	/* Read out real @src_eb, pointed by @src_parent and @src_slot */
> +	child_bytenr = btrfs_node_blockptr(src_parent, src_slot);
> +	child_gen = btrfs_node_ptr_generation(src_parent, src_slot);
> +	btrfs_node_key_to_cpu(src_parent, &first_key, src_slot);
> +
> +	src_eb = read_tree_block(fs_info, child_bytenr, child_gen,
> +			btrfs_header_level(src_parent) - 1, &first_key);
> +	if (IS_ERR(src_eb)) {
> +		ret = PTR_ERR(src_eb);
> +		goto out;
> +	}
> +
> +	/* Read out real @dst_eb, pointed by @src_parent and @src_slot */
> +	child_bytenr = btrfs_node_blockptr(dst_parent, dst_slot);
> +	child_gen = btrfs_node_ptr_generation(dst_parent, dst_slot);
> +	btrfs_node_key_to_cpu(dst_parent, &first_key, dst_slot);
> +
> +	dst_eb = read_tree_block(fs_info, child_bytenr, child_gen,
> +			btrfs_header_level(dst_parent) - 1, &first_key);
> +	if (IS_ERR(dst_eb)) {
> +		ret = PTR_ERR(dst_eb);
> +		goto out;
> +	}
> +
> +	if (!extent_buffer_uptodate(src_eb) || !extent_buffer_uptodate(dst_eb)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	level = btrfs_header_level(dst_eb);
> +	dst_path = btrfs_alloc_path();
> +	if (!dst_path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	/* For dst_path */
> +	extent_buffer_get(dst_eb);
> +	dst_path->nodes[level] = dst_eb;
> +	dst_path->slots[level] = 0;
> +	dst_path->locks[level] = 0;
> +
> +	/* Do the generation aware breadth-first search */
> +	ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
> +					      level, last_snapshot);
> +	if (ret < 0)
> +		goto out;
> +	ret = 0;
> +
> +out:
> +	free_extent_buffer(src_eb);
> +	free_extent_buffer(dst_eb);
> +	btrfs_free_path(dst_path);
> +	if (ret < 0)
> +		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> +	return ret;
> +}
> +
>  int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>  			       struct extent_buffer *root_eb,
>  			       u64 root_gen, int root_level)
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 54b8bb282c0e..9f9943dfd493 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -236,6 +236,16 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
>  int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>  			       struct extent_buffer *root_eb,
>  			       u64 root_gen, int root_level);
> +
> +/*
> + * Inform qgroup to trace subtree swap used in balance.
> + * Unlike btrfs_qgroup_trace_subtree(), this function will only trace
> + * new tree blocks whose generation is equal to (or larger than) @last_snapshot.
> + */
> +int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
> +				struct extent_buffer *src_parent, int src_slot,
> +				struct extent_buffer *dst_parent, int dst_slot,
> +				u64 last_snapshot);

I've noticed that qgroup.h contains the function comments, but it's
preferred to have them in the .c file, next to the code. Please move it
and feel free to send patch moving the rest.

>  int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>  				u64 num_bytes, struct ulist *old_roots,
>  				struct ulist *new_roots);
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 8783a1776540..07ab61a740ae 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -1879,14 +1879,9 @@ int replace_path(struct btrfs_trans_handle *trans,
>  		 *    and tree block numbers, if current trans doesn't free
>  		 *    data reloc tree inode.
>  		 */
> -		ret = btrfs_qgroup_trace_subtree(trans, parent,
> -				btrfs_header_generation(parent),
> -				btrfs_header_level(parent));
> -		if (ret < 0)
> -			break;
> -		ret = btrfs_qgroup_trace_subtree(trans, path->nodes[level],
> -				btrfs_header_generation(path->nodes[level]),
> -				btrfs_header_level(path->nodes[level]));
> +		ret = btrfs_qgroup_trace_subtree_swap(trans, parent, slot,
> +				path->nodes[level], path->slots[level],
> +				last_snapshot);
>  		if (ret < 0)
>  			break;
>  
> -- 
> 2.18.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 5/7] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
  2018-09-11  5:38 ` [PATCH v3 5/7] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree Qu Wenruo
@ 2018-09-26 14:35   ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2018-09-26 14:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Sep 11, 2018 at 01:38:16PM +0800, Qu Wenruo wrote:
> Reloc tree doesn't contribute to qgroup numbers, as we have
> accounted them at balance time (check replace_path()).
> 
> Skip such unneeded subtree trace should reduce some performance
> overhead.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index de6f75f5547b..4588153f414c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8643,7 +8643,13 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
>  			parent = 0;
>  		}
>  
> -		if (need_account) {
> +		/*
> +		 * Tree reloc tree doesn't contribute to qgroup numbers, and

As you're going to send another iteration, please update the 'tree reloc
tree' references here and in other patches too.

> +		 * we have already accounted them at merge time (replace_path),
> +		 * thus we could skip expensive subtree trace here.
> +		 */
> +		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> +		    need_account) {
>  			ret = btrfs_qgroup_trace_subtree(trans, next,
>  							 generation, level - 1);
>  			if (ret) {
> -- 
> 2.18.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/7] btrfs: delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing
  2018-09-11  5:38 ` [PATCH v3 6/7] btrfs: delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing Qu Wenruo
@ 2018-09-26 14:40   ` David Sterba
  2018-09-27  5:31     ` Qu Wenruo
  0 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2018-09-26 14:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Sep 11, 2018 at 01:38:17PM +0800, Qu Wenruo wrote:
> For btrfs_add_delayed_tree_ref(), its ref_root parameter can be
> different from its real root.
> This is pretty common for reloc tree, in that case @ref_root is passed
> as the original tree owner (source file tree).
> 
> However btrfs_add_delayed_tree_ref() uses @ref_root to determine whether
> we should do heavy qgroup tracing for it, so for balance this could
> cause a lot of unnecessary extent tracing.
> 
> This patch will introduce a parameter, @root for qgroup code to
> determine whether we really need to do the heavy-lifting tracing work.
> 
> This patch would reduce the qgroup dirty extent number from 'number of
> extents being relocated' to 0 for the transaction before real tree swap.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/delayed-ref.c |  5 +++--
>  fs/btrfs/delayed-ref.h |  1 +
>  fs/btrfs/extent-tree.c | 17 ++++++++++-------
>  3 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 62ff545ba1f7..7f681dd36e7f 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -710,8 +710,9 @@ static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
>   * transaction commits.
>   */
>  int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
> +			       struct btrfs_root *root,
>  			       u64 bytenr, u64 num_bytes, u64 parent,
> -			       u64 ref_root,  int level, int action,
> +			       u64 ref_root, int level, int action,

Unrelated change

>  			       struct btrfs_delayed_extent_op *extent_op,
>  			       int *old_ref_mod, int *new_ref_mod)
>  {
> @@ -737,7 +738,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  	}
>  
>  	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
> -	    is_fstree(ref_root)) {
> +	    is_fstree(root->root_key.objectid)) {
>  		record = kmalloc(sizeof(*record), GFP_NOFS);
>  		if (!record) {
>  			kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index d9f2a4ebd5db..a51560a3cb66 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -235,6 +235,7 @@ static inline void btrfs_put_delayed_ref_head(struct btrfs_delayed_ref_head *hea
>  }
>  
>  int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
> +			       struct btrfs_root *root,
>  			       u64 bytenr, u64 num_bytes, u64 parent,
>  			       u64 ref_root, int level, int action,
>  			       struct btrfs_delayed_extent_op *extent_op,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4588153f414c..7a3c650b11f8 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2040,9 +2040,10 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>  			   owner, offset, BTRFS_ADD_DELAYED_REF);
>  
>  	if (owner < BTRFS_FIRST_FREE_OBJECTID) {
> -		ret = btrfs_add_delayed_tree_ref(trans, bytenr,
> +		ret = btrfs_add_delayed_tree_ref(trans, root, bytenr,
>  						 num_bytes, parent,
> -						 root_objectid, (int)owner,
> +						 root_objectid,
> +						 (int)owner,

Here as well and a few more below. The (int) typecast is actually a bug
as it trims the owner to just 32bits, so if you want to update the
formatting, please send a separate patch.

>  						 BTRFS_ADD_DELAYED_REF, NULL,
>  						 &old_ref_mod, &new_ref_mod);
>  	} else {
> @@ -6993,7 +6994,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>  				   root->root_key.objectid,
>  				   btrfs_header_level(buf), 0,
>  				   BTRFS_DROP_DELAYED_REF);
> -		ret = btrfs_add_delayed_tree_ref(trans, buf->start,
> +		ret = btrfs_add_delayed_tree_ref(trans, root, buf->start,
>  						 buf->len, parent,
>  						 root->root_key.objectid,
>  						 btrfs_header_level(buf),
> @@ -7072,9 +7073,10 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
>  		old_ref_mod = new_ref_mod = 0;
>  		ret = 0;
>  	} else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
> -		ret = btrfs_add_delayed_tree_ref(trans, bytenr,
> +		ret = btrfs_add_delayed_tree_ref(trans, root, bytenr,
>  						 num_bytes, parent,
> -						 root_objectid, (int)owner,
> +						 root_objectid,
> +						 (int)owner,
>  						 BTRFS_DROP_DELAYED_REF, NULL,
>  						 &old_ref_mod, &new_ref_mod);
>  	} else {
> @@ -8293,9 +8295,10 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
>  		btrfs_ref_tree_mod(root, ins.objectid, ins.offset, parent,
>  				   root_objectid, level, 0,
>  				   BTRFS_ADD_DELAYED_EXTENT);
> -		ret = btrfs_add_delayed_tree_ref(trans, ins.objectid,
> +		ret = btrfs_add_delayed_tree_ref(trans, root, ins.objectid,
>  						 ins.offset, parent,
> -						 root_objectid, level,
> +						 root_objectid,
> +						 level,
>  						 BTRFS_ADD_DELAYED_EXTENT,
>  						 extent_op, NULL, NULL);
>  		if (ret)
> -- 
> 2.18.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/7] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree
  2018-09-26 14:29   ` David Sterba
@ 2018-09-26 14:40     ` Qu Wenruo
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2018-09-26 14:40 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6017 bytes --]



On 2018/9/26 下午10:29, David Sterba wrote:
> On Tue, Sep 11, 2018 at 01:38:14PM +0800, Qu Wenruo wrote:
>> Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
>> all new tree blocks in a reloc tree.
>> So that qgroup could skip unrelated tree blocks during balance, which
>> should hugely speedup balance speed when quota is enabled.
>>
>> The function qgroup_trace_new_subtree_blocks() itself only cares about
>> new tree blocks in reloc tree.
>>
>> All its main works are:
>>
>> 1) Read out tree blocks according to parent pointers
>>
>> 2) Do recursive depth-first search
>>    Will call the same function on all its children tree blocks, with
>>    search level set to current level -1.
>>    And will also skip all children whose generation is smaller than
>>    @last_snapshot.
>>
>> 3) Call qgroup_trace_extent_swap() to trace tree blocks
>>
>> So although we have parameter list related to source file tree, it's not
>> used at all, but only passed to qgroup_trace_extent_swap().
>> Thus despite the tree read code, the core should be pretty short and all
>> about recursive depth-first search.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/qgroup.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 114 insertions(+)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 5155fb42ce79..0702953d70a7 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -1839,6 +1839,120 @@ static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
>>  	return ret;
>>  }
>>  
>> +/*
>> + * Helper function to do recursive generation aware depth-first search, to
>> + * locate all new tree blocks in a subtree of tree reloc tree.
>> + *
>> + * E.g. (OO = Old tree blocks, NN = New tree blocks, whose gen == last_snapshot)
>> + *       Tree reloc tree
>> + * L2         NN (a)
>> + *          /    \
>> + * L1    OO        NN (b)
>> + *      /  \      /  \
>> + * L0  OO  OO    OO  NN
>> + *               (c) (d)
>> + * If we pass:
>> + * @dst_path = [ nodes[1] = NN(b), nodes[0] = NULL ],
>> + * @cur_level = 1
>> + * @root_level = 1
>> + *
>> + * We will iterate through tree blocks NN(b), NN(d) and info qgroup to trace
>> + * above tree blocks along with their counter parts in file tree.
>> + * While during search, old tree blocsk OO(c) will be skiped as tree block swap
>> + * won't affect OO(c).
>> + */
>> +static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
>> +					   struct extent_buffer *src_eb,
>> +					   struct btrfs_path *dst_path,
>> +					   int cur_level, int root_level,
>> +					   u64 last_snapshot)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct extent_buffer *eb;
>> +	bool need_cleanup = false;
>> +	int ret = 0;
>> +	int i;
> 
> This function could be called recursively based on the value of
> cur_level so please add some sanity checks so this does not accidentally
> escape. It needs to be a plain if/return, not just an assert or bug-on.

OK, I'll add a cur_level check here, and also enhance tree-checker to do
a level check for nodes/leaves, so we could catch level problem even
earlier, and keep all level passed into this function is valid.

Thanks,
Qu

> 
>> +
>> +	/* Read the tree block if needed */
>> +	if (dst_path->nodes[cur_level] == NULL) {
>> +		struct btrfs_key first_key;
>> +		int parent_slot;
>> +		u64 child_gen;
>> +		u64 child_bytenr;
>> +
>> +		/*
>> +		 * We need to get child blockptr/gen from parent before
>> +		 * we can read it.
>> +		  */
>> +		eb = dst_path->nodes[cur_level + 1];
>> +		parent_slot = dst_path->slots[cur_level + 1];
>> +		child_bytenr = btrfs_node_blockptr(eb, parent_slot);
>> +		child_gen = btrfs_node_ptr_generation(eb, parent_slot);
>> +		btrfs_node_key_to_cpu(eb, &first_key, parent_slot);
>> +
>> +		/* This node is old, no need to trace */
>> +		if (child_gen < last_snapshot)
>> +			goto out;
>> +
>> +		eb = read_tree_block(fs_info, child_bytenr, child_gen,
>> +				     cur_level, &first_key);
>> +		if (IS_ERR(eb)) {
>> +			ret = PTR_ERR(eb);
>> +			goto out;
>> +		} else if (!extent_buffer_uptodate(eb)) {
>> +			free_extent_buffer(eb);
>> +			ret = -EIO;
>> +			goto out;
>> +		}
>> +
>> +		dst_path->nodes[cur_level] = eb;
>> +		dst_path->slots[cur_level] = 0;
>> +
>> +		btrfs_tree_read_lock(eb);
>> +		btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
>> +		dst_path->locks[cur_level] = BTRFS_READ_LOCK_BLOCKING;
>> +		need_cleanup = true;
>> +	}
>> +
>> +	/* Now record this tree block and its counter part for qgroups */
>> +	ret = qgroup_trace_extent_swap(trans, src_eb, dst_path, cur_level,
>> +				       root_level);
>> +	if (ret < 0)
>> +		goto cleanup;
>> +
>> +	eb = dst_path->nodes[cur_level];
>> +
>> +	if (cur_level > 0) {
>> +		/* Iterate all children tree blocks */
>> +		for (i = 0; i < btrfs_header_nritems(eb); i++) {
>> +			/* Skip old tree blocks as they won't be swapped */
>> +			if (btrfs_node_ptr_generation(eb, i) < last_snapshot)
>> +				continue;
>> +			dst_path->slots[cur_level] = i;
>> +
>> +			/* Recursive call (at most 7 times) */
>> +			ret = qgroup_trace_new_subtree_blocks(trans, src_eb,
>> +					dst_path, cur_level - 1, root_level,
>> +					last_snapshot);
>> +			if (ret < 0)
>> +				goto cleanup;
>> +		}
>> +	}
>> +
>> +cleanup:
>> +	if (need_cleanup) {
>> +		/* Clean up */
>> +		btrfs_tree_unlock_rw(dst_path->nodes[cur_level],
>> +				     dst_path->locks[cur_level]);
>> +		free_extent_buffer(dst_path->nodes[cur_level]);
>> +		dst_path->nodes[cur_level] = NULL;
>> +		dst_path->slots[cur_level] = 0;
>> +		dst_path->locks[cur_level] = 0;
>> +	}
>> +out:
>> +	return ret;
>> +}
>> +
>>  int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>>  			       struct extent_buffer *root_eb,
>>  			       u64 root_gen, int root_level)
>> -- 
>> 2.18.0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/7] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
  2018-09-26 14:35   ` David Sterba
@ 2018-09-26 14:43     ` Qu Wenruo
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2018-09-26 14:43 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8852 bytes --]



On 2018/9/26 下午10:35, David Sterba wrote:
> On Tue, Sep 11, 2018 at 01:38:15PM +0800, Qu Wenruo wrote:
>> Before this patch, for quota enabled balance, btrfs needs to mark the
>> whole subtree dirty for quota.
>>
>> E.g.
>> OO = Old tree blocks (from file tree)
>> NN = New tree blocks (from reloc tree)
>>
>>         File tree (src)		          Reloc tree (dst)
>>             OO (a)                              NN (a)
>>            /  \                                /  \
>>      (b) OO    OO (c)                    (b) NN    NN (c)
>>         /  \  /  \                          /  \  /  \
>>        OO  OO OO OO (d)                    OO  OO OO NN (d)
>>
>> For old balance + quota case, quota will mark the whole src and dst tree
>> dirty, including all the 3 old tree blocks in reloc tree.
>>
>> It's doable for small file tree or new tree blocks are all
>> located at lower level.
>>
>> But for large file tree or new tree blocks are all located at higher
>> level, this will lead to mark the whole tree dirty, and be unbelievably
>> slow.
>>
>> This patch will change how we handle such balance with quota enabled
>> case.
>>
>> Now we will search from (b) and (c) for any new tree blocks whose generation
>> is equal to @last_snapshot, and only mark them dirty.
>>
>> In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
>> (NN(a) will be traced when CoW happens for nodeptr modification).
>> And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when
>> CoW happens for nodeptr modification)
>>
>> For above case, we could skip 3 tree blocks, but for larger tree, we can
>> skip tons of unmodified tree blocks, and hugely speed up balance.
>>
>> This patch will introduce a new function,
>> btrfs_qgroup_trace_subtree_swap(), which will do the following main
>> work:
>>
>> 1) Read out real root eb
>>    And setup basic dst_path for later calls
>> 2) Call qgroup_trace_new_subtree_blocks()
>>    To trace all new tree blocks in reloc tree and their counter
>>    parts in file tree.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/qgroup.c     | 94 +++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/qgroup.h     | 10 +++++
>>  fs/btrfs/relocation.c | 11 ++---
>>  3 files changed, 107 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 0702953d70a7..a94027b2620e 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -1953,6 +1953,100 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
>>  	return ret;
>>  }
>>  
>> +/*
>> + * For balance subtree swap.
>> + *
>> + * Will go down the tree block pointed by @dst_eb (pointed by @dst_parent and
>> + * @dst_slot), and find any tree blocks whose generation is at @last_snapshot,
>> + * and then go down @src_eb (pointed by @src_parent and @src_slot) to find
>> + * the conterpart of the tree block, then mark both tree blocks as qgroup dirty,
>> + * and skip all tree blocks whose generation is smaller than last_snapshot.
>> + *
>> + * This would skip tons of tree blocks of original btrfs_qgroup_trace_subtree(),
>> + * which is the culprit of super slow balance if the file tree is large.
>> + *
>> + * @src_parent, @src_slot: pointer to src (file tree) eb.
>> + * @dst_parent, @dst_slot: pointer to dst (tree reloc tree) eb.
>> + */
>> +int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
>> +				struct extent_buffer *src_parent, int src_slot,
>> +				struct extent_buffer *dst_parent, int dst_slot,
>> +				u64 last_snapshot)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_path *dst_path = NULL;
>> +	struct btrfs_key first_key;
>> +	struct extent_buffer *src_eb = NULL;
>> +	struct extent_buffer *dst_eb = NULL;
>> +	u64 child_gen;
>> +	u64 child_bytenr;
>> +	int level;
>> +	int ret;
>> +
>> +	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
>> +		return 0;
>> +
>> +	/* Wrong parameter order */
>> +	BUG_ON(btrfs_node_ptr_generation(src_parent, src_slot) >
>> +		btrfs_node_ptr_generation(dst_parent, dst_slot));
> 
> If this is to catch errors development-time errors, then an ASSERT is
> probably better, but it looks like this needs to be an ordinary runtime
> sanity check too.

For runtime check, it's already done before entering real relocation code.

Relocation will check tree generation before really doing the swap.

So it is a development time error.

> 
>> +
>> +	/* Read out real @src_eb, pointed by @src_parent and @src_slot */
>> +	child_bytenr = btrfs_node_blockptr(src_parent, src_slot);
>> +	child_gen = btrfs_node_ptr_generation(src_parent, src_slot);
>> +	btrfs_node_key_to_cpu(src_parent, &first_key, src_slot);
>> +
>> +	src_eb = read_tree_block(fs_info, child_bytenr, child_gen,
>> +			btrfs_header_level(src_parent) - 1, &first_key);
>> +	if (IS_ERR(src_eb)) {
>> +		ret = PTR_ERR(src_eb);
>> +		goto out;
>> +	}
>> +
>> +	/* Read out real @dst_eb, pointed by @src_parent and @src_slot */
>> +	child_bytenr = btrfs_node_blockptr(dst_parent, dst_slot);
>> +	child_gen = btrfs_node_ptr_generation(dst_parent, dst_slot);
>> +	btrfs_node_key_to_cpu(dst_parent, &first_key, dst_slot);
>> +
>> +	dst_eb = read_tree_block(fs_info, child_bytenr, child_gen,
>> +			btrfs_header_level(dst_parent) - 1, &first_key);
>> +	if (IS_ERR(dst_eb)) {
>> +		ret = PTR_ERR(dst_eb);
>> +		goto out;
>> +	}
>> +
>> +	if (!extent_buffer_uptodate(src_eb) || !extent_buffer_uptodate(dst_eb)) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>> +	level = btrfs_header_level(dst_eb);
>> +	dst_path = btrfs_alloc_path();
>> +	if (!dst_path) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +	/* For dst_path */
>> +	extent_buffer_get(dst_eb);
>> +	dst_path->nodes[level] = dst_eb;
>> +	dst_path->slots[level] = 0;
>> +	dst_path->locks[level] = 0;
>> +
>> +	/* Do the generation aware breadth-first search */
>> +	ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
>> +					      level, last_snapshot);
>> +	if (ret < 0)
>> +		goto out;
>> +	ret = 0;
>> +
>> +out:
>> +	free_extent_buffer(src_eb);
>> +	free_extent_buffer(dst_eb);
>> +	btrfs_free_path(dst_path);
>> +	if (ret < 0)
>> +		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
>> +	return ret;
>> +}
>> +
>>  int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>>  			       struct extent_buffer *root_eb,
>>  			       u64 root_gen, int root_level)
>> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
>> index 54b8bb282c0e..9f9943dfd493 100644
>> --- a/fs/btrfs/qgroup.h
>> +++ b/fs/btrfs/qgroup.h
>> @@ -236,6 +236,16 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
>>  int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>>  			       struct extent_buffer *root_eb,
>>  			       u64 root_gen, int root_level);
>> +
>> +/*
>> + * Inform qgroup to trace subtree swap used in balance.
>> + * Unlike btrfs_qgroup_trace_subtree(), this function will only trace
>> + * new tree blocks whose generation is equal to (or larger than) @last_snapshot.
>> + */
>> +int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
>> +				struct extent_buffer *src_parent, int src_slot,
>> +				struct extent_buffer *dst_parent, int dst_slot,
>> +				u64 last_snapshot);
> 
> I've noticed that qgroup.h contains the function comments, but it's
> preferred to have them in the .c file, next to the code. Please move it
> and feel free to send patch moving the rest.

A new code style learned.

Will update the patchset to reflect this comment.

Thanks,
Qu

> 
>>  int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>>  				u64 num_bytes, struct ulist *old_roots,
>>  				struct ulist *new_roots);
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 8783a1776540..07ab61a740ae 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -1879,14 +1879,9 @@ int replace_path(struct btrfs_trans_handle *trans,
>>  		 *    and tree block numbers, if current trans doesn't free
>>  		 *    data reloc tree inode.
>>  		 */
>> -		ret = btrfs_qgroup_trace_subtree(trans, parent,
>> -				btrfs_header_generation(parent),
>> -				btrfs_header_level(parent));
>> -		if (ret < 0)
>> -			break;
>> -		ret = btrfs_qgroup_trace_subtree(trans, path->nodes[level],
>> -				btrfs_header_generation(path->nodes[level]),
>> -				btrfs_header_level(path->nodes[level]));
>> +		ret = btrfs_qgroup_trace_subtree_swap(trans, parent, slot,
>> +				path->nodes[level], path->slots[level],
>> +				last_snapshot);
>>  		if (ret < 0)
>>  			break;
>>  
>> -- 
>> 2.18.0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/7] btrfs: delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing
  2018-09-26 14:40   ` David Sterba
@ 2018-09-27  5:31     ` Qu Wenruo
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2018-09-27  5:31 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5845 bytes --]



On 2018/9/26 下午10:40, David Sterba wrote:
> On Tue, Sep 11, 2018 at 01:38:17PM +0800, Qu Wenruo wrote:
>> For btrfs_add_delayed_tree_ref(), its ref_root parameter can be
>> different from its real root.
>> This is pretty common for reloc tree, in that case @ref_root is passed
>> as the original tree owner (source file tree).
>>
>> However btrfs_add_delayed_tree_ref() uses @ref_root to determine whether
>> we should do heavy qgroup tracing for it, so for balance this could
>> cause a lot of unnecessary extent tracing.
>>
>> This patch will introduce a parameter, @root for qgroup code to
>> determine whether we really need to do the heavy-lifting tracing work.
>>
>> This patch would reduce the qgroup dirty extent number from 'number of
>> extents being relocated' to 0 for the transaction before real tree swap.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/delayed-ref.c |  5 +++--
>>  fs/btrfs/delayed-ref.h |  1 +
>>  fs/btrfs/extent-tree.c | 17 ++++++++++-------
>>  3 files changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index 62ff545ba1f7..7f681dd36e7f 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -710,8 +710,9 @@ static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
>>   * transaction commits.
>>   */
>>  int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>> +			       struct btrfs_root *root,
>>  			       u64 bytenr, u64 num_bytes, u64 parent,
>> -			       u64 ref_root,  int level, int action,
>> +			       u64 ref_root, int level, int action,
> 
> Unrelated change

Will avoid it in next update.

> 
>>  			       struct btrfs_delayed_extent_op *extent_op,
>>  			       int *old_ref_mod, int *new_ref_mod)
>>  {
>> @@ -737,7 +738,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>>  	}
>>  
>>  	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
>> -	    is_fstree(ref_root)) {
>> +	    is_fstree(root->root_key.objectid)) {
>>  		record = kmalloc(sizeof(*record), GFP_NOFS);
>>  		if (!record) {
>>  			kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
>> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
>> index d9f2a4ebd5db..a51560a3cb66 100644
>> --- a/fs/btrfs/delayed-ref.h
>> +++ b/fs/btrfs/delayed-ref.h
>> @@ -235,6 +235,7 @@ static inline void btrfs_put_delayed_ref_head(struct btrfs_delayed_ref_head *hea
>>  }
>>  
>>  int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>> +			       struct btrfs_root *root,
>>  			       u64 bytenr, u64 num_bytes, u64 parent,
>>  			       u64 ref_root, int level, int action,
>>  			       struct btrfs_delayed_extent_op *extent_op,
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 4588153f414c..7a3c650b11f8 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2040,9 +2040,10 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>>  			   owner, offset, BTRFS_ADD_DELAYED_REF);
>>  
>>  	if (owner < BTRFS_FIRST_FREE_OBJECTID) {
>> -		ret = btrfs_add_delayed_tree_ref(trans, bytenr,
>> +		ret = btrfs_add_delayed_tree_ref(trans, root, bytenr,
>>  						 num_bytes, parent,
>> -						 root_objectid, (int)owner,
>> +						 root_objectid,
>> +						 (int)owner,
> 
> Here as well and a few more below. The (int) typecast is actually a bug
> as it trims the owner to just 32bits, so if you want to update the
> formatting, please send a separate patch.

Will avoid the unrelated change in next update.

But for the "bug" part, it's a mess.

Since owner is smaller than BTRFS_FIRST_FREE_OBJECTID, truncating to
32bits is in fact completely OK.

However the declaration of btrfs_add_delayed_tree_ref() says that
parameter is in fact @level.
On the other hand, ref->level is passed to __btrfs_inc_extent_ref() as
@onwer (and u64) again, so it's not causing problem (or we're already
screwed up badly).

The naming and forced convert is really a mess here though.

I'll just keep that mess as is and don't touch it for the patchset.

Thanks,
Qu

> 
>>  						 BTRFS_ADD_DELAYED_REF, NULL,
>>  						 &old_ref_mod, &new_ref_mod);
>>  	} else {
>> @@ -6993,7 +6994,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>>  				   root->root_key.objectid,
>>  				   btrfs_header_level(buf), 0,
>>  				   BTRFS_DROP_DELAYED_REF);
>> -		ret = btrfs_add_delayed_tree_ref(trans, buf->start,
>> +		ret = btrfs_add_delayed_tree_ref(trans, root, buf->start,
>>  						 buf->len, parent,
>>  						 root->root_key.objectid,
>>  						 btrfs_header_level(buf),
>> @@ -7072,9 +7073,10 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
>>  		old_ref_mod = new_ref_mod = 0;
>>  		ret = 0;
>>  	} else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
>> -		ret = btrfs_add_delayed_tree_ref(trans, bytenr,
>> +		ret = btrfs_add_delayed_tree_ref(trans, root, bytenr,
>>  						 num_bytes, parent,
>> -						 root_objectid, (int)owner,
>> +						 root_objectid,
>> +						 (int)owner,
>>  						 BTRFS_DROP_DELAYED_REF, NULL,
>>  						 &old_ref_mod, &new_ref_mod);
>>  	} else {
>> @@ -8293,9 +8295,10 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
>>  		btrfs_ref_tree_mod(root, ins.objectid, ins.offset, parent,
>>  				   root_objectid, level, 0,
>>  				   BTRFS_ADD_DELAYED_EXTENT);
>> -		ret = btrfs_add_delayed_tree_ref(trans, ins.objectid,
>> +		ret = btrfs_add_delayed_tree_ref(trans, root, ins.objectid,
>>  						 ins.offset, parent,
>> -						 root_objectid, level,
>> +						 root_objectid,
>> +						 level,
>>  						 BTRFS_ADD_DELAYED_EXTENT,
>>  						 extent_op, NULL, NULL);
>>  		if (ret)
>> -- 
>> 2.18.0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-09-27  5:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-11  5:38 [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata Qu Wenruo
2018-09-11  5:38 ` [PATCH v3 1/7] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
2018-09-26 14:06   ` David Sterba
2018-09-11  5:38 ` [PATCH v3 2/7] btrfs: qgroup: Introduce function to trace two swaped extents Qu Wenruo
2018-09-11  5:38 ` [PATCH v3 3/7] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree Qu Wenruo
2018-09-26 14:29   ` David Sterba
2018-09-26 14:40     ` Qu Wenruo
2018-09-11  5:38 ` [PATCH v3 4/7] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents Qu Wenruo
2018-09-26 14:35   ` David Sterba
2018-09-26 14:43     ` Qu Wenruo
2018-09-11  5:38 ` [PATCH v3 5/7] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree Qu Wenruo
2018-09-26 14:35   ` David Sterba
2018-09-11  5:38 ` [PATCH v3 6/7] btrfs: delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing Qu Wenruo
2018-09-26 14:40   ` David Sterba
2018-09-27  5:31     ` Qu Wenruo
2018-09-11  5:38 ` [PATCH v3 7/7] btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group Qu Wenruo
2018-09-26 14:06 ` [PATCH v3 0/7] btrfs: qgroup: Reduce dirty extents for metadata David Sterba
2018-09-26 14:17   ` Qu Wenruo
2018-09-26 14:25     ` Qu Wenruo
2018-09-26 14:26     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.