All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance
@ 2018-09-07  9:32 Qu Wenruo
  2018-09-07  9:32 ` [PATCH v2 1/5] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Qu Wenruo @ 2018-09-07  9:32 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from github:
https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees
The base commit is v4.19-rc1 tag.

There are a lot of reports of system hang for balance on quota enabled
fs.
It's most obvious for large fs.

The hang is caused by tons of unmodified extents marked as qgroup dirty.
Such unmodified/unrelated sources include:
1) Unmodified subtree
2) Subtree drop for reloc tree
(BTW, other sources includes unmodified file extent items)

E.g.
OO = Old tree blocks from file tree
NN = New tree blocks from reloc tree

        file tree                              reloc tree
           OO (a)                                  NN (a)
          /  \                                    /  \
    (b) OO    OO (c)                        (b) NN    NN (c)
       / \   / \                               / \   / \
     OO  OO OO  OO                           OO  OO OO  NN
    (d) (e) (f) (g)                         (d) (e) (f) (g)

In above case, balance will modify nodeptr in OO(a) to point NN(b) and
NN(c), and modify NN(a) to point to OO(B) and OO(c).

Before this patch, quota will mark the whole subtree from its parent
down to the leaves as dirty.
So btrfs quota need to trace all tree block from (a) to (g).

However tree blocks (d) (e) (f) are shared between both trees, thus
there is no need to trace those 3 tree blocks.

This patchset will change how this work by only tracing modified tree
blocks in reloc tree, and their counter parts in file tree.

Nodeptr swap will happen for tree blocks (b) and (c) in both tree.

For tree block (b), in reloc tree we could find that all its
children's generation is smaller than last_snapshot, thus no need to
trace them, only need to trace NN(b), and its counter part OO(b).

For tree block (c), in reloc tree, we find its child NN(g) need
tracing, and for tree block NN(g), there is no child need to trace.

So for subtree starting at tree block NN(c), we need to trace NN(c) and
NN(g), along with its counter part OO(c) and OO(c).

With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
thus reduce some some overhead caused by qgroup.

The improvement is mostly related to metadata relocation.
If there is some high level tree blocks get relocated but its children are
still unmodified, we could save a lot of time.

Even for the worst case, it should be no worse than original full
subtree marking method.

Real world case benchmark is under way.

Changelog:
v2:
  Rename "tree reloc tree" to "reloc tree".
  Add patch "Don't trace subtree if we're dropping reloc tree" into the
  patchset.
  Fix wrong btrfs_bin_search() call, which leads to unexpected ENOENT
  error for btrfs_qgroup_trace_extent_swap(). Now use dst_path->slots[]
  directly.

Qu Wenruo (5):
  btrfs: qgroup: Introduce trace event to analyse the number of dirty
    extents accounted
  btrfs: qgroup: Introduce function to trace two swaped extents
  btrfs: qgroup: Introduce function to find all new tree blocks of reloc
    tree
  btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
  btrfs: qgroup: Don't trace subtree if we're dropping reloc tree

 fs/btrfs/extent-tree.c       |   8 +-
 fs/btrfs/qgroup.c            | 338 +++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h            |  10 ++
 fs/btrfs/relocation.c        |  11 +-
 include/trace/events/btrfs.h |  21 +++
 5 files changed, 379 insertions(+), 9 deletions(-)

-- 
2.18.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/5] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted
  2018-09-07  9:32 [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
@ 2018-09-07  9:32 ` Qu Wenruo
  2018-09-07  9:32 ` [PATCH v2 2/5] btrfs: qgroup: Introduce function to trace two swaped extents Qu Wenruo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2018-09-07  9:32 UTC (permalink / raw)
  To: linux-btrfs

Number of qgroup dirty extents is directly linked to the performance
overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
record how many dirty extents is processed in
btrfs_qgroup_account_extents().

This will be pretty handy to analyse later balance performance
improvement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c            |  4 ++++
 include/trace/events/btrfs.h | 21 +++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4353bb69bb86..5977eedc00d8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2133,6 +2133,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct ulist *new_roots = NULL;
 	struct rb_node *node;
+	u64 num_dirty_extents = 0;
 	u64 qgroup_to_skip;
 	int ret = 0;
 
@@ -2142,6 +2143,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 		record = rb_entry(node, struct btrfs_qgroup_extent_record,
 				  node);
 
+		num_dirty_extents++;
 		trace_btrfs_qgroup_account_extents(fs_info, record);
 
 		if (!ret) {
@@ -2187,6 +2189,8 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 		kfree(record);
 
 	}
+	trace_qgroup_num_dirty_extents(fs_info, trans->transid,
+				       num_dirty_extents);
 	return ret;
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b401c4e36394..79253666d1d0 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1575,6 +1575,27 @@ DEFINE_EVENT(btrfs_qgroup_extent, btrfs_qgroup_trace_extent,
 	TP_ARGS(fs_info, rec)
 );
 
+TRACE_EVENT(qgroup_num_dirty_extents,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 transid,
+		 u64 num_dirty_extents),
+
+	TP_ARGS(fs_info, transid, num_dirty_extents),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64, transid			)
+		__field(	u64, num_dirty_extents		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->transid	   = transid;
+		__entry->num_dirty_extents = num_dirty_extents;
+	),
+
+	TP_printk_btrfs("transid=%llu num_dirty_extents=%llu",
+		__entry->transid, __entry->num_dirty_extents)
+);
+
 TRACE_EVENT(btrfs_qgroup_account_extent,
 
 	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 transid, u64 bytenr,
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/5] btrfs: qgroup: Introduce function to trace two swaped extents
  2018-09-07  9:32 [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
  2018-09-07  9:32 ` [PATCH v2 1/5] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
@ 2018-09-07  9:32 ` Qu Wenruo
  2018-09-07  9:32 ` [PATCH v2 3/5] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree Qu Wenruo
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2018-09-07  9:32 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function, qgroup_trace_extent_swap(), which will be used
later for balance qgroup speedup.

The basis idea of balance is swapping tree blocks between reloc tree and
the real file tree.

The swap will happen in highest tree block, but there may be a lot of
tree blocks involved.

For example:
 OO = Old tree blocks
 NN = New tree blocks allocated during balance

          File tree (257)                  Reloc tree for 257
L2              OO                                NN
              /    \                            /    \
L1          OO      OO (a)                    OO      NN (a)
           / \     / \                       / \     / \
L0       OO   OO OO   OO                   OO   OO NN   NN
                 (b)  (c)                          (b)  (c)

When calling qgroup_trace_extent_swap(), we will pass:
@src_eb = OO(a)
@dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
@dst_level = 0
@root_level = 1

In that case, qgroup_trace_extent_swap() will search from OO(a) to
reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.

The main work of qgroup_trace_extent_swap() can be split into 3 parts:

1) Tree search from @src_eb
   It should acts as a simplified btrfs_search_slot().
   The key for search can be extracted from @dst_path->nodes[dst_level]
   (first key).

2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
   NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
   They should be marked during preivous (@dst_level = 1) iteration.

3) Mark file extents in leaves dirty
   We don't have good way to pick out new file extents only.
   So we still follow the old method by scanning all file extents in
   the leave.

This function can free us from keeping two pathes, thus later we only need
to care about how to iterate all new tree blocks in reloc tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 5977eedc00d8..5155fb42ce79 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1713,6 +1713,132 @@ static int adjust_slots_upwards(struct btrfs_path *path, int root_level)
 	return 0;
 }
 
+/*
+ * helper function to trace a subtree tree block swap.
+ *
+ * The swapped tree block is marked by @dst_path->nodes[level].
+ * Thus @dst_path should have been populated already.
+ *
+ * While for src tree block, we only need the root eb as @src_eb.
+ * The needed path and search will done by this function.
+ */
+static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
+				    struct extent_buffer *src_eb,
+				    struct btrfs_path *dst_path,
+				    int dst_level, int root_level)
+{
+	struct btrfs_key key;
+	struct btrfs_path *src_path;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	u32 nodesize = fs_info->nodesize;
+	int cur_level = root_level;
+	int ret;
+
+	BUG_ON(dst_level > root_level);
+	/* Level mismatch */
+	if (btrfs_header_level(src_eb) != root_level)
+		return -EINVAL;
+
+	src_path = btrfs_alloc_path();
+	if (!src_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (dst_level)
+		btrfs_node_key_to_cpu(dst_path->nodes[dst_level], &key, 0);
+	else
+		btrfs_item_key_to_cpu(dst_path->nodes[dst_level], &key, 0);
+
+	/* For src_path */
+	extent_buffer_get(src_eb);
+	src_path->nodes[root_level] = src_eb;
+	src_path->slots[root_level] = dst_path->slots[root_level];
+	src_path->locks[root_level] = 0;
+
+	/* A simplified version of btrfs_search_slot() */
+	while (cur_level >= dst_level) {
+		struct btrfs_key src_key;
+		struct btrfs_key dst_key;
+
+		if (src_path->nodes[cur_level] == NULL) {
+			struct btrfs_key first_key;
+			struct extent_buffer *eb;
+			int parent_slot;
+			u64 child_gen;
+			u64 child_bytenr;
+
+			eb = src_path->nodes[cur_level + 1];
+			parent_slot = src_path->slots[cur_level + 1];
+			child_bytenr = btrfs_node_blockptr(eb, parent_slot);
+			child_gen = btrfs_node_ptr_generation(eb, parent_slot);
+			btrfs_node_key_to_cpu(eb, &first_key, parent_slot);
+
+			eb = read_tree_block(fs_info, child_bytenr, child_gen,
+					     cur_level, &first_key);
+			if (IS_ERR(eb)) {
+				ret = PTR_ERR(eb);
+				goto out;
+			} else if (!extent_buffer_uptodate(eb)) {
+				free_extent_buffer(eb);
+				ret = -EIO;
+				goto out;
+			}
+
+			src_path->nodes[cur_level] = eb;
+
+			btrfs_tree_read_lock(eb);
+			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+			src_path->locks[cur_level] = BTRFS_READ_LOCK_BLOCKING;
+		}
+
+		src_path->slots[cur_level] = dst_path->slots[cur_level];
+		if (cur_level) {
+			btrfs_node_key_to_cpu(dst_path->nodes[cur_level], &dst_key,
+					      dst_path->slots[cur_level]);
+			btrfs_node_key_to_cpu(src_path->nodes[cur_level], &src_key,
+					      src_path->slots[cur_level]);
+		} else {
+			btrfs_item_key_to_cpu(dst_path->nodes[cur_level], &dst_key,
+					      dst_path->slots[cur_level]);
+			btrfs_item_key_to_cpu(src_path->nodes[cur_level], &src_key,
+					      src_path->slots[cur_level]);
+		}
+		/* Content mismatch, something went wrong */
+		if (btrfs_comp_cpu_keys(&dst_key, &src_key)) {
+			ret = -ENOENT;
+			goto out;
+		}
+		cur_level--;
+	}
+
+	/*
+	 * Now both @dst_path and @src_path have been populated, record the tree
+	 * blocks for qgroup accounting.
+	 */
+	ret = btrfs_qgroup_trace_extent(trans,
+			src_path->nodes[dst_level]->start,
+			nodesize, GFP_NOFS);
+	if (ret < 0)
+		goto out;
+	ret = btrfs_qgroup_trace_extent(trans,
+			dst_path->nodes[dst_level]->start,
+			nodesize, GFP_NOFS);
+	if (ret < 0)
+		goto out;
+
+	/* Record leaf file extents */
+	if (dst_level == 0) {
+		ret = btrfs_qgroup_trace_leaf_items(trans, src_path->nodes[0]);
+		if (ret < 0)
+			goto out;
+		ret = btrfs_qgroup_trace_leaf_items(trans, dst_path->nodes[0]);
+	}
+out:
+	btrfs_free_path(src_path);
+	return ret;
+}
+
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level)
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/5] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree
  2018-09-07  9:32 [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
  2018-09-07  9:32 ` [PATCH v2 1/5] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
  2018-09-07  9:32 ` [PATCH v2 2/5] btrfs: qgroup: Introduce function to trace two swaped extents Qu Wenruo
@ 2018-09-07  9:32 ` Qu Wenruo
  2018-09-07  9:32 ` [PATCH v2 4/5] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents Qu Wenruo
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2018-09-07  9:32 UTC (permalink / raw)
  To: linux-btrfs

Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
all new tree blocks in a reloc tree.
So that qgroup could skip unrelated tree blocks during balance, which
should hugely speedup balance speed when quota is enabled.

The function qgroup_trace_new_subtree_blocks() itself only cares about
new tree blocks in reloc tree.

All its main works are:

1) Read out tree blocks according to parent pointers

2) Do recursive depth-first search
   Will call the same function on all its children tree blocks, with
   search level set to current level -1.
   And will also skip all children whose generation is smaller than
   @last_snapshot.

3) Call qgroup_trace_extent_swap() to trace tree blocks

So although we have parameter list related to source file tree, it's not
used at all, but only passed to qgroup_trace_extent_swap().
Thus despite the tree read code, the core should be pretty short and all
about recursive depth-first search.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 5155fb42ce79..0702953d70a7 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1839,6 +1839,120 @@ static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
 	return ret;
 }
 
+/*
+ * Helper function to do recursive generation aware depth-first search, to
+ * locate all new tree blocks in a subtree of tree reloc tree.
+ *
+ * E.g. (OO = Old tree blocks, NN = New tree blocks, whose gen == last_snapshot)
+ *       Tree reloc tree
+ * L2         NN (a)
+ *          /    \
+ * L1    OO        NN (b)
+ *      /  \      /  \
+ * L0  OO  OO    OO  NN
+ *               (c) (d)
+ * If we pass:
+ * @dst_path = [ nodes[1] = NN(b), nodes[0] = NULL ],
+ * @cur_level = 1
+ * @root_level = 1
+ *
+ * We will iterate through tree blocks NN(b), NN(d) and info qgroup to trace
+ * above tree blocks along with their counter parts in file tree.
+ * While during search, old tree blocsk OO(c) will be skiped as tree block swap
+ * won't affect OO(c).
+ */
+static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
+					   struct extent_buffer *src_eb,
+					   struct btrfs_path *dst_path,
+					   int cur_level, int root_level,
+					   u64 last_snapshot)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct extent_buffer *eb;
+	bool need_cleanup = false;
+	int ret = 0;
+	int i;
+
+	/* Read the tree block if needed */
+	if (dst_path->nodes[cur_level] == NULL) {
+		struct btrfs_key first_key;
+		int parent_slot;
+		u64 child_gen;
+		u64 child_bytenr;
+
+		/*
+		 * We need to get child blockptr/gen from parent before
+		 * we can read it.
+		  */
+		eb = dst_path->nodes[cur_level + 1];
+		parent_slot = dst_path->slots[cur_level + 1];
+		child_bytenr = btrfs_node_blockptr(eb, parent_slot);
+		child_gen = btrfs_node_ptr_generation(eb, parent_slot);
+		btrfs_node_key_to_cpu(eb, &first_key, parent_slot);
+
+		/* This node is old, no need to trace */
+		if (child_gen < last_snapshot)
+			goto out;
+
+		eb = read_tree_block(fs_info, child_bytenr, child_gen,
+				     cur_level, &first_key);
+		if (IS_ERR(eb)) {
+			ret = PTR_ERR(eb);
+			goto out;
+		} else if (!extent_buffer_uptodate(eb)) {
+			free_extent_buffer(eb);
+			ret = -EIO;
+			goto out;
+		}
+
+		dst_path->nodes[cur_level] = eb;
+		dst_path->slots[cur_level] = 0;
+
+		btrfs_tree_read_lock(eb);
+		btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+		dst_path->locks[cur_level] = BTRFS_READ_LOCK_BLOCKING;
+		need_cleanup = true;
+	}
+
+	/* Now record this tree block and its counter part for qgroups */
+	ret = qgroup_trace_extent_swap(trans, src_eb, dst_path, cur_level,
+				       root_level);
+	if (ret < 0)
+		goto cleanup;
+
+	eb = dst_path->nodes[cur_level];
+
+	if (cur_level > 0) {
+		/* Iterate all children tree blocks */
+		for (i = 0; i < btrfs_header_nritems(eb); i++) {
+			/* Skip old tree blocks as they won't be swapped */
+			if (btrfs_node_ptr_generation(eb, i) < last_snapshot)
+				continue;
+			dst_path->slots[cur_level] = i;
+
+			/* Recursive call (at most 7 times) */
+			ret = qgroup_trace_new_subtree_blocks(trans, src_eb,
+					dst_path, cur_level - 1, root_level,
+					last_snapshot);
+			if (ret < 0)
+				goto cleanup;
+		}
+	}
+
+cleanup:
+	if (need_cleanup) {
+		/* Clean up */
+		btrfs_tree_unlock_rw(dst_path->nodes[cur_level],
+				     dst_path->locks[cur_level]);
+		free_extent_buffer(dst_path->nodes[cur_level]);
+		dst_path->nodes[cur_level] = NULL;
+		dst_path->slots[cur_level] = 0;
+		dst_path->locks[cur_level] = 0;
+	}
+out:
+	return ret;
+}
+
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level)
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/5] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
  2018-09-07  9:32 [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
                   ` (2 preceding siblings ...)
  2018-09-07  9:32 ` [PATCH v2 3/5] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree Qu Wenruo
@ 2018-09-07  9:32 ` Qu Wenruo
  2018-09-07  9:32 ` [PATCH v2 5/5] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree Qu Wenruo
  2018-09-11  2:43 ` [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
  5 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2018-09-07  9:32 UTC (permalink / raw)
  To: linux-btrfs

Before this patch, for quota enabled balance, btrfs needs to mark the
whole subtree dirty for quota.

E.g.
OO = Old tree blocks (from file tree)
NN = New tree blocks (from reloc tree)

        File tree (src)		          Reloc tree (dst)
            OO (a)                              NN (a)
           /  \                                /  \
     (b) OO    OO (c)                    (b) NN    NN (c)
        /  \  /  \                          /  \  /  \
       OO  OO OO OO (d)                    OO  OO OO NN (d)

For old balance + quota case, quota will mark the whole src and dst tree
dirty, including all the 3 old tree blocks in reloc tree.

It's doable for small file tree or new tree blocks are all
located at lower level.

But for large file tree or new tree blocks are all located at higher
level, this will lead to mark the whole tree dirty, and be unbelievably
slow.

This patch will change how we handle such balance with quota enabled
case.

Now we will search from (b) and (c) for any new tree blocks whose generation
is equal to @last_snapshot, and only mark them dirty.

In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
(NN(a) will be traced when CoW happens for nodeptr modification).
And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when
CoW happens for nodeptr modification)

For above case, we could skip 3 tree blocks, but for larger tree, we can
skip tons of unmodified tree blocks, and hugely speed up balance.

This patch will introduce a new function,
btrfs_qgroup_trace_subtree_swap(), which will do the following main
work:

1) Read out real root eb
   And setup basic dst_path for later calls
2) Call qgroup_trace_new_subtree_blocks()
   To trace all new tree blocks in reloc tree and their counter
   parts in file tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c     | 94 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h     | 10 +++++
 fs/btrfs/relocation.c | 11 ++---
 3 files changed, 107 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 0702953d70a7..a94027b2620e 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1953,6 +1953,100 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
 	return ret;
 }
 
+/*
+ * For balance subtree swap.
+ *
+ * Will go down the tree block pointed by @dst_eb (pointed by @dst_parent and
+ * @dst_slot), and find any tree blocks whose generation is at @last_snapshot,
+ * and then go down @src_eb (pointed by @src_parent and @src_slot) to find
+ * the conterpart of the tree block, then mark both tree blocks as qgroup dirty,
+ * and skip all tree blocks whose generation is smaller than last_snapshot.
+ *
+ * This would skip tons of tree blocks of original btrfs_qgroup_trace_subtree(),
+ * which is the culprit of super slow balance if the file tree is large.
+ *
+ * @src_parent, @src_slot: pointer to src (file tree) eb.
+ * @dst_parent, @dst_slot: pointer to dst (tree reloc tree) eb.
+ */
+int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
+				struct extent_buffer *src_parent, int src_slot,
+				struct extent_buffer *dst_parent, int dst_slot,
+				u64 last_snapshot)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_path *dst_path = NULL;
+	struct btrfs_key first_key;
+	struct extent_buffer *src_eb = NULL;
+	struct extent_buffer *dst_eb = NULL;
+	u64 child_gen;
+	u64 child_bytenr;
+	int level;
+	int ret;
+
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+		return 0;
+
+	/* Wrong parameter order */
+	BUG_ON(btrfs_node_ptr_generation(src_parent, src_slot) >
+		btrfs_node_ptr_generation(dst_parent, dst_slot));
+
+	/* Read out real @src_eb, pointed by @src_parent and @src_slot */
+	child_bytenr = btrfs_node_blockptr(src_parent, src_slot);
+	child_gen = btrfs_node_ptr_generation(src_parent, src_slot);
+	btrfs_node_key_to_cpu(src_parent, &first_key, src_slot);
+
+	src_eb = read_tree_block(fs_info, child_bytenr, child_gen,
+			btrfs_header_level(src_parent) - 1, &first_key);
+	if (IS_ERR(src_eb)) {
+		ret = PTR_ERR(src_eb);
+		goto out;
+	}
+
+	/* Read out real @dst_eb, pointed by @src_parent and @src_slot */
+	child_bytenr = btrfs_node_blockptr(dst_parent, dst_slot);
+	child_gen = btrfs_node_ptr_generation(dst_parent, dst_slot);
+	btrfs_node_key_to_cpu(dst_parent, &first_key, dst_slot);
+
+	dst_eb = read_tree_block(fs_info, child_bytenr, child_gen,
+			btrfs_header_level(dst_parent) - 1, &first_key);
+	if (IS_ERR(dst_eb)) {
+		ret = PTR_ERR(dst_eb);
+		goto out;
+	}
+
+	if (!extent_buffer_uptodate(src_eb) || !extent_buffer_uptodate(dst_eb)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	level = btrfs_header_level(dst_eb);
+	dst_path = btrfs_alloc_path();
+	if (!dst_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* For dst_path */
+	extent_buffer_get(dst_eb);
+	dst_path->nodes[level] = dst_eb;
+	dst_path->slots[level] = 0;
+	dst_path->locks[level] = 0;
+
+	/* Do the generation aware breadth-first search */
+	ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
+					      level, last_snapshot);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+
+out:
+	free_extent_buffer(src_eb);
+	free_extent_buffer(dst_eb);
+	btrfs_free_path(dst_path);
+	if (ret < 0)
+		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
+	return ret;
+}
+
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level)
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 54b8bb282c0e..9f9943dfd493 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -236,6 +236,16 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level);
+
+/*
+ * Inform qgroup to trace subtree swap used in balance.
+ * Unlike btrfs_qgroup_trace_subtree(), this function will only trace
+ * new tree blocks whose generation is equal to (or larger than) @last_snapshot.
+ */
+int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
+				struct extent_buffer *src_parent, int src_slot,
+				struct extent_buffer *dst_parent, int dst_slot,
+				u64 last_snapshot);
 int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 				u64 num_bytes, struct ulist *old_roots,
 				struct ulist *new_roots);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8783a1776540..07ab61a740ae 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1879,14 +1879,9 @@ int replace_path(struct btrfs_trans_handle *trans,
 		 *    and tree block numbers, if current trans doesn't free
 		 *    data reloc tree inode.
 		 */
-		ret = btrfs_qgroup_trace_subtree(trans, parent,
-				btrfs_header_generation(parent),
-				btrfs_header_level(parent));
-		if (ret < 0)
-			break;
-		ret = btrfs_qgroup_trace_subtree(trans, path->nodes[level],
-				btrfs_header_generation(path->nodes[level]),
-				btrfs_header_level(path->nodes[level]));
+		ret = btrfs_qgroup_trace_subtree_swap(trans, parent, slot,
+				path->nodes[level], path->slots[level],
+				last_snapshot);
 		if (ret < 0)
 			break;
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 5/5] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
  2018-09-07  9:32 [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
                   ` (3 preceding siblings ...)
  2018-09-07  9:32 ` [PATCH v2 4/5] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents Qu Wenruo
@ 2018-09-07  9:32 ` Qu Wenruo
  2018-09-11  2:43 ` [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
  5 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2018-09-07  9:32 UTC (permalink / raw)
  To: linux-btrfs

Reloc tree doesn't contribute to qgroup numbers, as we have
accounted them at balance time (check replace_path()).

Skip such unneeded subtree trace should reduce some performance
overhead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-tree.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de6f75f5547b..4588153f414c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8643,7 +8643,13 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 			parent = 0;
 		}
 
-		if (need_account) {
+		/*
+		 * Tree reloc tree doesn't contribute to qgroup numbers, and
+		 * we have already accounted them at merge time (replace_path),
+		 * thus we could skip expensive subtree trace here.
+		 */
+		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
+		    need_account) {
 			ret = btrfs_qgroup_trace_subtree(trans, next,
 							 generation, level - 1);
 			if (ret) {
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance
  2018-09-07  9:32 [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
                   ` (4 preceding siblings ...)
  2018-09-07  9:32 ` [PATCH v2 5/5] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree Qu Wenruo
@ 2018-09-11  2:43 ` Qu Wenruo
  2018-09-11  8:46   ` David Sterba
  5 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2018-09-11  2:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4733 bytes --]



On 2018/9/7 下午5:32, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees
> The base commit is v4.19-rc1 tag.
> 
> There are a lot of reports of system hang for balance on quota enabled
> fs.
> It's most obvious for large fs.
> 
> The hang is caused by tons of unmodified extents marked as qgroup dirty.
> Such unmodified/unrelated sources include:
> 1) Unmodified subtree
> 2) Subtree drop for reloc tree
> (BTW, other sources includes unmodified file extent items)
> 
> E.g.
> OO = Old tree blocks from file tree
> NN = New tree blocks from reloc tree
> 
>         file tree                              reloc tree
>            OO (a)                                  NN (a)
>           /  \                                    /  \
>     (b) OO    OO (c)                        (b) NN    NN (c)
>        / \   / \                               / \   / \
>      OO  OO OO  OO                           OO  OO OO  NN
>     (d) (e) (f) (g)                         (d) (e) (f) (g)
> 
> In above case, balance will modify nodeptr in OO(a) to point NN(b) and
> NN(c), and modify NN(a) to point to OO(B) and OO(c).
> 
> Before this patch, quota will mark the whole subtree from its parent
> down to the leaves as dirty.
> So btrfs quota need to trace all tree block from (a) to (g).
> 
> However tree blocks (d) (e) (f) are shared between both trees, thus
> there is no need to trace those 3 tree blocks.
> 
> This patchset will change how this work by only tracing modified tree
> blocks in reloc tree, and their counter parts in file tree.
> 
> Nodeptr swap will happen for tree blocks (b) and (c) in both tree.
> 
> For tree block (b), in reloc tree we could find that all its
> children's generation is smaller than last_snapshot, thus no need to
> trace them, only need to trace NN(b), and its counter part OO(b).
> 
> For tree block (c), in reloc tree, we find its child NN(g) need
> tracing, and for tree block NN(g), there is no child need to trace.
> 
> So for subtree starting at tree block NN(c), we need to trace NN(c) and
> NN(g), along with its counter part OO(c) and OO(c).
> 
> With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
> thus reduce some some overhead caused by qgroup.
> 
> The improvement is mostly related to metadata relocation.
> If there is some high level tree blocks get relocated but its children are
> still unmodified, we could save a lot of time.
> 
> Even for the worst case, it should be no worse than original full
> subtree marking method.
> 
> Real world case benchmark is under way.

Did a small scale test. (With latest submitted patch "btrfs:
delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to
reduce unnecessary qgroup tracing")

4K nodesize fs (to bump tree sizes), around 4G data copied from /usr and
/lib (so number of files should be large enough).

The VM has unsafe cache mode for its qcow2 file, and the backing device
is a SAMSUNG 850 evo sata SSD. (Host has enough RAM so most IO should be
as fast as RAM speed).

The for metadata only balance:

                     | Before          | After       | Diff
--------------------------------------------------------------------------
relocated extents    | 21112           | 22916       | +8.5%
qgroup dirty extents | 213831          | 140731      | -30.0%
time (sys)           | 7.828s          | 5.818s      | -25.7%
time (real)          | 10.004s         | 7.768s      | -22.3%

I'll report back with even larger fs with more subvolumes/snapshots.

Thanks,
Qu

> 
> Changelog:
> v2:
>   Rename "tree reloc tree" to "reloc tree".
>   Add patch "Don't trace subtree if we're dropping reloc tree" into the
>   patchset.
>   Fix wrong btrfs_bin_search() call, which leads to unexpected ENOENT
>   error for btrfs_qgroup_trace_extent_swap(). Now use dst_path->slots[]
>   directly.
> 
> Qu Wenruo (5):
>   btrfs: qgroup: Introduce trace event to analyse the number of dirty
>     extents accounted
>   btrfs: qgroup: Introduce function to trace two swaped extents
>   btrfs: qgroup: Introduce function to find all new tree blocks of reloc
>     tree
>   btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
>   btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
> 
>  fs/btrfs/extent-tree.c       |   8 +-
>  fs/btrfs/qgroup.c            | 338 +++++++++++++++++++++++++++++++++++
>  fs/btrfs/qgroup.h            |  10 ++
>  fs/btrfs/relocation.c        |  11 +-
>  include/trace/events/btrfs.h |  21 +++
>  5 files changed, 379 insertions(+), 9 deletions(-)
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance
  2018-09-11  2:43 ` [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
@ 2018-09-11  8:46   ` David Sterba
  0 siblings, 0 replies; 8+ messages in thread
From: David Sterba @ 2018-09-11  8:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On Tue, Sep 11, 2018 at 10:43:32AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/9/7 下午5:32, Qu Wenruo wrote:
> > This patchset can be fetched from github:
> > https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees
> > The base commit is v4.19-rc1 tag.
> > 
> > There are a lot of reports of system hang for balance on quota enabled
> > fs.
> > It's most obvious for large fs.
> > 
> > The hang is caused by tons of unmodified extents marked as qgroup dirty.
> > Such unmodified/unrelated sources include:
> > 1) Unmodified subtree
> > 2) Subtree drop for reloc tree
> > (BTW, other sources includes unmodified file extent items)
> > 
> > E.g.
> > OO = Old tree blocks from file tree
> > NN = New tree blocks from reloc tree
> > 
> >         file tree                              reloc tree
> >            OO (a)                                  NN (a)
> >           /  \                                    /  \
> >     (b) OO    OO (c)                        (b) NN    NN (c)
> >        / \   / \                               / \   / \
> >      OO  OO OO  OO                           OO  OO OO  NN
> >     (d) (e) (f) (g)                         (d) (e) (f) (g)
> > 
> > In above case, balance will modify nodeptr in OO(a) to point NN(b) and
> > NN(c), and modify NN(a) to point to OO(B) and OO(c).
> > 
> > Before this patch, quota will mark the whole subtree from its parent
> > down to the leaves as dirty.
> > So btrfs quota need to trace all tree block from (a) to (g).
> > 
> > However tree blocks (d) (e) (f) are shared between both trees, thus
> > there is no need to trace those 3 tree blocks.
> > 
> > This patchset will change how this work by only tracing modified tree
> > blocks in reloc tree, and their counter parts in file tree.
> > 
> > Nodeptr swap will happen for tree blocks (b) and (c) in both tree.
> > 
> > For tree block (b), in reloc tree we could find that all its
> > children's generation is smaller than last_snapshot, thus no need to
> > trace them, only need to trace NN(b), and its counter part OO(b).
> > 
> > For tree block (c), in reloc tree, we find its child NN(g) need
> > tracing, and for tree block NN(g), there is no child need to trace.
> > 
> > So for subtree starting at tree block NN(c), we need to trace NN(c) and
> > NN(g), along with its counter part OO(c) and OO(c).
> > 
> > With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
> > thus reduce some some overhead caused by qgroup.
> > 
> > The improvement is mostly related to metadata relocation.
> > If there is some high level tree blocks get relocated but its children are
> > still unmodified, we could save a lot of time.
> > 
> > Even for the worst case, it should be no worse than original full
> > subtree marking method.
> > 
> > Real world case benchmark is under way.
> 
> Did a small scale test. (With latest submitted patch "btrfs:
> delayed-ref: Introduce new parameter for btrfs_add_delayed_tree_ref() to
> reduce unnecessary qgroup tracing")
> 
> 4K nodesize fs (to bump tree sizes), around 4G data copied from /usr and
> /lib (so number of files should be large enough).
> 
> The VM has unsafe cache mode for its qcow2 file, and the backing device
> is a SAMSUNG 850 evo sata SSD. (Host has enough RAM so most IO should be
> as fast as RAM speed).
> 
> The for metadata only balance:
> 
>                      | Before          | After       | Diff
> --------------------------------------------------------------------------
> relocated extents    | 21112           | 22916       | +8.5%
> qgroup dirty extents | 213831          | 140731      | -30.0%
> time (sys)           | 7.828s          | 5.818s      | -25.7%
> time (real)          | 10.004s         | 7.768s      | -22.3%

Thanks, this looks good. I'd speculate that the improvement on systems
where the IO is not memory backed will be improved by the reduced count
of the dirty extents. But the memory-backed IO results look good on
itself.

I'll add the patches to for-next, feel free to send more testing results
or updates. Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-09-11 13:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-07  9:32 [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
2018-09-07  9:32 ` [PATCH v2 1/5] btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted Qu Wenruo
2018-09-07  9:32 ` [PATCH v2 2/5] btrfs: qgroup: Introduce function to trace two swaped extents Qu Wenruo
2018-09-07  9:32 ` [PATCH v2 3/5] btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree Qu Wenruo
2018-09-07  9:32 ` [PATCH v2 4/5] btrfs: qgroup: Use generation aware subtree swap to mark dirty extents Qu Wenruo
2018-09-07  9:32 ` [PATCH v2 5/5] btrfs: qgroup: Don't trace subtree if we're dropping reloc tree Qu Wenruo
2018-09-11  2:43 ` [PATCH v2 0/5] btrfs: qgroup: Skip unrelated tree blocks for balance Qu Wenruo
2018-09-11  8:46   ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.