All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] btrfs: qgroup: Pause and resume qgroup for subvolume removal
@ 2019-07-31 10:23 Qu Wenruo
  2019-07-31 10:23 ` [PATCH RFC 1/3] btrfs: Refactor btrfs_clean_one_deleted_snapshot() to use fs_info other than root Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Qu Wenruo @ 2019-07-31 10:23 UTC (permalink / raw)
  To: linux-btrfs

Btrfs qgroup has one big performance overhead for certain snapshot drop.

Current btrfs_drop_snapshot() will try its best to find the highest
shared node, and just drop one ref of that common node.
This behavior is good for minimize extent tree modification, but a
disaster for qgroup.

Example:
      Root 300  Root 301
        A         B
        | \     / |
        |    X    |
        | /     \ |
        C         D
      /   \     /   \
     E     F    G    H
    / \   / \  / \   / \
   I   J K  L  M  N O   P

In above case, if we're dropping root 301, btrfs_drop_snapshot() will
only drop one ref for tree block C and D.

But for qgroup, tree blocks E~P also have their owner changed, from
300, 301 to 300 only.

Currently we use btrfs_qgroup_trace_subtree() to manually re-dirty tree
block E~P. And since such ref change happens in one transaction for each
ref drop, we can't split the overhead to different transactions.

This could cause qgroup extent record flood, hugely damage performance
or even cause OOM for too many qgroup extent records.


Unfortunately unlike previous qgroup + balance optimization, we can't
really do much to help.
So to make the problem less obvious, let's just pause qgroup and resume
it for snapshot drop, and queue a rescan automatically.

For independent subvolume or small snapshot, we don't really need to do
such pause/rescan workaround, as regular per-tree-block dropping is
pretty qgroup friendly.

Qu Wenruo (3):
  btrfs: Refactor btrfs_clean_one_deleted_snapshot() to use fs_info
    other than root
  btrfs: qgroup: Introduce quota pause infrastrucutre
  btrfs: Pause and resume qgroup for snapshot drop

 fs/btrfs/ctree.h       |  3 +-
 fs/btrfs/disk-io.c     |  9 ++---
 fs/btrfs/extent-tree.c | 13 ++++++-
 fs/btrfs/qgroup.c      | 79 ++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/qgroup.h      |  3 ++
 fs/btrfs/relocation.c  |  4 +--
 fs/btrfs/transaction.c |  8 ++---
 fs/btrfs/transaction.h |  2 +-
 8 files changed, 106 insertions(+), 15 deletions(-)

-- 
2.22.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC 1/3] btrfs: Refactor btrfs_clean_one_deleted_snapshot() to use fs_info other than root
  2019-07-31 10:23 [PATCH RFC 0/3] btrfs: qgroup: Pause and resume qgroup for subvolume removal Qu Wenruo
@ 2019-07-31 10:23 ` Qu Wenruo
  2019-07-31 10:23 ` [PATCH RFC 2/3] btrfs: qgroup: Introduce quota pause infrastrucutre Qu Wenruo
  2019-07-31 10:23 ` [PATCH RFC 3/3] btrfs: Pause and resume qgroup for snapshot drop Qu Wenruo
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2019-07-31 10:23 UTC (permalink / raw)
  To: linux-btrfs

btrfs_clean_one_deleted_snapshot() will grab needed root from fs_info,
thus no need to pass a @root parameter.

Also refactor its sole caller, cleaner_kthread() to use fs_info other
than root.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c     | 7 +++----
 fs/btrfs/transaction.c | 4 ++--
 fs/btrfs/transaction.h | 2 +-
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index deb74a8c191a..9c11ac5edef5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1689,8 +1689,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
 
 static int cleaner_kthread(void *arg)
 {
-	struct btrfs_root *root = arg;
-	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_fs_info *fs_info = arg;
 	int again;
 
 	while (1) {
@@ -1723,7 +1722,7 @@ static int cleaner_kthread(void *arg)
 
 		btrfs_run_delayed_iputs(fs_info);
 
-		again = btrfs_clean_one_deleted_snapshot(root);
+		again = btrfs_clean_one_deleted_snapshot(fs_info);
 		mutex_unlock(&fs_info->cleaner_mutex);
 
 		/*
@@ -3124,7 +3123,7 @@ int open_ctree(struct super_block *sb,
 		goto fail_sysfs;
 	}
 
-	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
+	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, fs_info,
 					       "btrfs-cleaner");
 	if (IS_ERR(fs_info->cleaner_kthread))
 		goto fail_sysfs;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3f6811cdf803..48d3b5123129 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2296,10 +2296,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
  * because btrfs_commit_super will poke cleaner thread and it will process it a
  * few seconds later.
  */
-int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root)
+int btrfs_clean_one_deleted_snapshot(struct btrfs_fs_info *fs_info)
 {
+	struct btrfs_root *root;
 	int ret;
-	struct btrfs_fs_info *fs_info = root->fs_info;
 
 	spin_lock(&fs_info->trans_lock);
 	if (list_empty(&fs_info->dead_roots)) {
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 78c446c222b7..e72e6cf919e2 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -190,7 +190,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid);
 
 void btrfs_add_dead_root(struct btrfs_root *root);
 int btrfs_defrag_root(struct btrfs_root *root);
-int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root);
+int btrfs_clean_one_deleted_snapshot(struct btrfs_fs_info *fs_info);
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans);
 int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
 				   int wait_for_unblock);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 2/3] btrfs: qgroup: Introduce quota pause infrastrucutre
  2019-07-31 10:23 [PATCH RFC 0/3] btrfs: qgroup: Pause and resume qgroup for subvolume removal Qu Wenruo
  2019-07-31 10:23 ` [PATCH RFC 1/3] btrfs: Refactor btrfs_clean_one_deleted_snapshot() to use fs_info other than root Qu Wenruo
@ 2019-07-31 10:23 ` Qu Wenruo
  2019-07-31 10:23 ` [PATCH RFC 3/3] btrfs: Pause and resume qgroup for snapshot drop Qu Wenruo
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2019-07-31 10:23 UTC (permalink / raw)
  To: linux-btrfs

This patch will introduce the following functions to provide a way to
pause/resume quota:
- btrfs_quota_pause()
- btrfs_quota_is_paused()
- btrfs_quota_resume()

btrfs_quota_pause() will pause qgroup and mark qgroup status to
inconsistent, while still keep the on-disk status as qgroup enabled.

In paused status, newer delayed refs will not be accounted for qgroup,
thus no qgroup overhead at commit transaction time.

While in paused status, btrfs qgroup status item and qgroup root is
still kept as it, the status item still has STATUS_FLAG_ON, so next
mount will still enable qgroup (although with STATUS_FLAG_INCONSIST
flag)

The main purpose is to allow btrfs to skip certain qgroup heavy workload
and trigger a rescan later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 79 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/qgroup.h |  3 ++
 2 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 3e6ffbbd8b0a..be51ccb8a1b4 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -22,6 +22,8 @@
 #include "extent_io.h"
 #include "qgroup.h"
 
+/* In-memory only flag for paused qgroup */
+#define BTRFS_QGROUP_STATUS_FLAG_PAUSED		(1 << 31)
 
 /* TODO XXX FIXME
  *  - subvol delete -> delete when ref goes to 0? delete limits also?
@@ -795,12 +797,16 @@ static int update_qgroup_status_item(struct btrfs_trans_handle *trans)
 	struct btrfs_key key;
 	struct extent_buffer *l;
 	struct btrfs_qgroup_status_item *ptr;
+	u64 on_disk_flag_mask;
 	int ret;
 	int slot;
 
 	key.objectid = 0;
 	key.type = BTRFS_QGROUP_STATUS_KEY;
 	key.offset = 0;
+	on_disk_flag_mask = BTRFS_QGROUP_STATUS_FLAG_ON |
+			    BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT |
+			    BTRFS_QGROUP_STATUS_FLAG_RESCAN;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -816,7 +822,8 @@ static int update_qgroup_status_item(struct btrfs_trans_handle *trans)
 	l = path->nodes[0];
 	slot = path->slots[0];
 	ptr = btrfs_item_ptr(l, slot, struct btrfs_qgroup_status_item);
-	btrfs_set_qgroup_status_flags(l, ptr, fs_info->qgroup_flags);
+	btrfs_set_qgroup_status_flags(l, ptr, fs_info->qgroup_flags &
+					      on_disk_flag_mask);
 	btrfs_set_qgroup_status_generation(l, ptr, trans->transid);
 	btrfs_set_qgroup_status_rescan(l, ptr,
 				fs_info->qgroup_rescan_progress.objectid);
@@ -1115,6 +1122,72 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
+/*
+ * Pause qgroup temporarily.
+ *
+ * Will mark qgroup inconsistent and update qgroup tree to reflect it.
+ * Then clear BTRFS_FS_QUOTA_ENABLED flag of fs_info, and cleanup any existing
+ * qgroup structures.
+ * But still leave on-disk qgroup status as quota enabled, so when user
+ * mount such qgroup paused fs, it still shows qgroup enabled.
+ */
+int btrfs_quota_pause(struct btrfs_trans_handle *trans)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_delayed_ref_root *delayed_refs;
+	int ret = 0;
+
+	delayed_refs = &trans->transaction->delayed_refs;
+	if (test_and_clear_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) {
+		if (!fs_info->quota_root)
+			return 0;
+
+		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT |
+					 BTRFS_QGROUP_STATUS_FLAG_PAUSED;
+		ret = update_qgroup_status_item(trans);
+
+		/* Clean up dirty qgroups */
+		spin_lock(&fs_info->qgroup_lock);
+		while (!list_empty(&fs_info->dirty_qgroups))
+			list_del_init(fs_info->dirty_qgroups.next);
+		spin_unlock(&fs_info->qgroup_lock);
+
+		/* Cleanup qgroup traced extents by running accounting now */
+		spin_lock(&delayed_refs->lock);
+		btrfs_qgroup_account_extents(trans);
+		spin_unlock(&delayed_refs->lock);
+
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+bool btrfs_quota_is_paused(struct btrfs_fs_info *fs_info)
+{
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
+	    fs_info->quota_root && fs_info->qgroup_flags &
+	    BTRFS_QGROUP_STATUS_FLAG_PAUSED)
+		return true;
+	return false;
+}
+
+/*
+ * Resume previously paused quota
+ *
+ * This function will re-set BTRFS_FS_QUOTA_ENABLED bit, and queue
+ * qgroup rescan work
+ */
+int btrfs_quota_resume(struct btrfs_fs_info *fs_info)
+{
+	if (!fs_info->quota_root)
+		return 0;
+	if (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_PAUSED))
+		return 0;
+	set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
+	return btrfs_qgroup_rescan(fs_info);
+}
+
 static void qgroup_dirty(struct btrfs_fs_info *fs_info,
 			 struct btrfs_qgroup *qgroup)
 {
@@ -2497,7 +2570,9 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 	u64 num_dirty_extents = 0;
 	u64 qgroup_to_skip;
 	int ret = 0;
+	bool need_search_roots;
 
+	need_search_roots = test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
 	delayed_refs = &trans->transaction->delayed_refs;
 	qgroup_to_skip = delayed_refs->qgroup_to_skip;
 	while ((node = rb_first(&delayed_refs->dirty_extent_root))) {
@@ -2507,7 +2582,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 		num_dirty_extents++;
 		trace_btrfs_qgroup_account_extents(fs_info, record);
 
-		if (!ret) {
+		if (!ret && need_search_roots) {
 			/*
 			 * Old roots should be searched when inserting qgroup
 			 * extent record
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 46ba7bd2961c..334941146a2e 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -234,6 +234,9 @@ struct btrfs_qgroup {
 
 int btrfs_quota_enable(struct btrfs_fs_info *fs_info);
 int btrfs_quota_disable(struct btrfs_fs_info *fs_info);
+int btrfs_quota_pause(struct btrfs_trans_handle *trans);
+bool btrfs_quota_is_paused(struct btrfs_fs_info *fs_info);
+int btrfs_quota_resume(struct btrfs_fs_info *fs_info);
 int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
 void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
 int btrfs_qgroup_wait_for_completion(struct btrfs_fs_info *fs_info,
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 3/3] btrfs: Pause and resume qgroup for snapshot drop
  2019-07-31 10:23 [PATCH RFC 0/3] btrfs: qgroup: Pause and resume qgroup for subvolume removal Qu Wenruo
  2019-07-31 10:23 ` [PATCH RFC 1/3] btrfs: Refactor btrfs_clean_one_deleted_snapshot() to use fs_info other than root Qu Wenruo
  2019-07-31 10:23 ` [PATCH RFC 2/3] btrfs: qgroup: Introduce quota pause infrastrucutre Qu Wenruo
@ 2019-07-31 10:23 ` Qu Wenruo
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2019-07-31 10:23 UTC (permalink / raw)
  To: linux-btrfs

Btrfs qgroup has one big performance overhead for certain snapshot drop.

Current btrfs_drop_snapshot() will try its best to find the highest
shared node, and just drop one ref of that common node.
This behavior is good for minimize extent tree modification, but a
disaster for qgroup.

Example:
      Root 300  Root 301
        A         B
        | \     / |
        |    X    |
        | /     \ |
        C         D
      /   \     /   \
     E     F    G    H
    / \   / \  / \   / \
   I   J K  L  M  N O   P

In above case, if we're dropping root 301, btrfs_drop_snapshot() will
only drop one ref for tree block C and D.

But for qgroup, tree blocks E~P also have their owner changed, from
300, 301 to 300 only.

Currently we use btrfs_qgroup_trace_subtree() to manually re-dirty tree
block E~P. And since such ref change happens in one transaction for each
ref drop, we can't split the overhead to different transactions.

This could cause qgroup extent record flood, hugely damage performance
or even cause OOM for too many qgroup extent records.

This patch will try to solve it in a different method, since we can't
split the overhead into different transactions, instead of doing such
heavy work, we just pause qgroup at the very beginning of
btrfs_drop_snapshot().

So later owner change won't trigger qgroup subtree trace, and after
all subvolumes get removed, we just trigger a qgroup rescan to rebuild
qgroup accounting.

Also, to co-operate previous qgroup balance optimization, we don't need
to pause qgroup to balance, thus introduce a new parameter for
btrfs_drop_snapshot() to inform caller that we have other optimization
to handle balance.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h       |  3 ++-
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/extent-tree.c | 13 ++++++++++++-
 fs/btrfs/relocation.c  |  4 ++--
 fs/btrfs/transaction.c |  4 ++--
 5 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0a61dff27f57..ef6db258e75d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3016,7 +3016,8 @@ static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path *p)
 int btrfs_leaf_free_space(struct extent_buffer *leaf);
 int __must_check btrfs_drop_snapshot(struct btrfs_root *root,
 				     struct btrfs_block_rsv *block_rsv,
-				     int update_ref, int for_reloc);
+				     int update_ref, int for_reloc,
+				     int pause_qgroup);
 int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root,
 			struct extent_buffer *node,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9c11ac5edef5..106b2bfff794 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1747,6 +1747,8 @@ static int cleaner_kthread(void *arg)
 		if (kthread_should_stop())
 			return 0;
 		if (!again) {
+			if (btrfs_quota_is_paused(fs_info))
+				btrfs_quota_resume(fs_info);
 			set_current_state(TASK_INTERRUPTIBLE);
 			schedule();
 			__set_current_state(TASK_RUNNING);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5faf057f6f37..a0b5de3216c8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9236,7 +9236,7 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
  */
 int btrfs_drop_snapshot(struct btrfs_root *root,
 			 struct btrfs_block_rsv *block_rsv, int update_ref,
-			 int for_reloc)
+			 int for_reloc, int pause_qgroup)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_path *path;
@@ -9343,6 +9343,17 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 		}
 	}
 
+	/*
+	 * If this subvolume is shared, and pretty high, we tend to pause
+	 * qgroup to avoid qgroup extent records flood due to sudden subtree
+	 * owner change happened in one transaction.
+	 */
+	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
+	    pause_qgroup && (root->root_key.offset ||
+	     btrfs_root_last_snapshot(&root->root_item)) &&
+	    btrfs_header_level(root->node))
+		btrfs_quota_pause(trans);
+
 	wc->restarted = test_bit(BTRFS_ROOT_DEAD_TREE, &root->state);
 	wc->level = level;
 	wc->shared_level = -1;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 22a3c69864fa..ceacf1155495 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2190,14 +2190,14 @@ static int clean_dirty_subvols(struct reloc_control *rc)
 			root->reloc_root = NULL;
 			if (reloc_root) {
 
-				ret2 = btrfs_drop_snapshot(reloc_root, NULL, 0, 1);
+				ret2 = btrfs_drop_snapshot(reloc_root, NULL, 0, 1, 0);
 				if (ret2 < 0 && !ret)
 					ret = ret2;
 			}
 			btrfs_put_fs_root(root);
 		} else {
 			/* Orphan reloc tree, just clean it up */
-			ret2 = btrfs_drop_snapshot(root, NULL, 0, 1);
+			ret2 = btrfs_drop_snapshot(root, NULL, 0, 1, 0);
 			if (ret2 < 0 && !ret)
 				ret = ret2;
 		}
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 48d3b5123129..7fe318318e28 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2317,9 +2317,9 @@ int btrfs_clean_one_deleted_snapshot(struct btrfs_fs_info *fs_info)
 
 	if (btrfs_header_backref_rev(root->node) <
 			BTRFS_MIXED_BACKREF_REV)
-		ret = btrfs_drop_snapshot(root, NULL, 0, 0);
+		ret = btrfs_drop_snapshot(root, NULL, 0, 0, 1);
 	else
-		ret = btrfs_drop_snapshot(root, NULL, 1, 0);
+		ret = btrfs_drop_snapshot(root, NULL, 1, 0, 1);
 
 	return (ret < 0) ? 0 : 1;
 }
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-07-31 10:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-31 10:23 [PATCH RFC 0/3] btrfs: qgroup: Pause and resume qgroup for subvolume removal Qu Wenruo
2019-07-31 10:23 ` [PATCH RFC 1/3] btrfs: Refactor btrfs_clean_one_deleted_snapshot() to use fs_info other than root Qu Wenruo
2019-07-31 10:23 ` [PATCH RFC 2/3] btrfs: qgroup: Introduce quota pause infrastrucutre Qu Wenruo
2019-07-31 10:23 ` [PATCH RFC 3/3] btrfs: Pause and resume qgroup for snapshot drop Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.