[PATCH v3 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153
@ 2020-07-08  6:24 Qu Wenruo
  2020-07-08  6:24 ` [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-07-08  6:24 UTC (permalink / raw)
  To: linux-btrfs

Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
btrfs/153 always fails with early EDQUOT.

This is caused by the fact that:
- We always reserved data space for even NODATACOW buffered write
  This is mostly to improve performance, and not pratical to revert.

- Btrfs qgroup data and meta reserved space share the same limit
  So it's not ensured to return EDQUOT just for that data reservation,
  metadata reservation can also get EDQUOT, means we can't go the same
  solution as that commit.

This patchset will solve it by doing extra qgroup space flushing when
EDQUOT is hit.

This is a little like what we do in ticketing space reservation system,
but since there are very limited ways for qgroup to reclaim space,
currently it's still handled in qgroup realm, not reusing the ticketing
system yet.

By this, this patch could solve the btrfs/153 problem, while still keep
btrfs qgroup space usage under the limit.

The only cost is, when we're near qgroup limit, we will cause more dirty
inodes flush and transaction commit, much like what we do when the
metadata space is near exhausted.
So the cost should be still acceptable.

Changelog:
v2:
- Use existing ulist infrastructure
  Thanks to the advice from Josef, we can just iterate the ulist nodes
  inside the failure range to remove the offending nodes.
  And since already dirtied range won't show up in the current extent
  changeset, we won't clear existing range.
  This saves around 100 new lines

- Remove the use of "Revert" in the 3rd cleanup patch
  We're not reverting to fix some regression, but just remove some
  deprecated mechanism which was designed to partially solve the
  problem.

v3:
- Fix a variable shadowing bug 
  In the 2nd patch, there was a variable shadowing bug, which overrides
  the @start and @len parameter.

- Update the commit message for the first patch
  To make the cause more obvious

- Use cleaner if condition before try_flush_qgroup()

Qu Wenruo (3):
  btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range
    it just set without releasing other existing ranges
  btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
  btrfs: qgroup: remove the ASYNC_COMMIT mechanism in favor of qgroup 
    reserve retry-after-EDQUOT

 fs/btrfs/ctree.h       |   6 +-
 fs/btrfs/disk-io.c     |   2 +-
 fs/btrfs/qgroup.c      | 234 ++++++++++++++++++++++++++++++-----------
 fs/btrfs/transaction.c |   1 -
 fs/btrfs/transaction.h |  14 ---
 5 files changed, 172 insertions(+), 85 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges
  2020-07-08  6:24 [PATCH v3 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
@ 2020-07-08  6:24 ` Qu Wenruo
  2020-07-08 14:09   ` Josef Bacik
  2020-07-09 16:02   ` David Sterba
  2020-07-08  6:24 ` [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT Qu Wenruo
  2020-07-08  6:24 ` [PATCH v3 3/3] btrfs: qgroup: remove the ASYNC_COMMIT mechanism in favor of qgroup reserve retry-after-EDQUOT Qu Wenruo
  2 siblings, 2 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-07-08  6:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

[PROBLEM]
Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
reserved space of the changeset.

For example:
	ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);

If the last btrfs_qgroup_reserve_data() failed, it will release all [0,
3M) range.

This behavior is kinda OK for now, as when we hit -EDQUOT, we normally
go error handling and need to release all reserved ranges anyway.

But this also means the following call is not possible:
	ret = btrfs_qgroup_reserve_data();
	if (ret == -EDQUOT) {
		/* Do something to free some qgroup space */
		ret = btrfs_qgroup_reserve_data();
	}

As if the first btrfs_qgroup_reserve_data() fails, it will free all
reserved qgroup space.

[CAUSE]
This is because we release all reserved ranges when
btrfs_qgroup_reserve_data() fails.

[FIX]
This patch will implement a new function, qgroup_revert(), to iterate
through the ulist nodes, to find any nodes in the failure range, and
remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and decrease
the extent_changeset::bytes_changed, so that we can revert to previous
status.

This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
happens.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 90 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 75 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index ee0ad33b659c..84a452dea3f9 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3447,6 +3447,71 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
 	}
 }
 
+static int qgroup_revert(struct btrfs_inode *inode,
+			struct extent_changeset *reserved, u64 start,
+			u64 len)
+{
+	struct rb_node *n = reserved->range_changed.root.rb_node;
+	struct ulist_node *entry = NULL;
+	int ret = 0;
+
+	while (n) {
+		entry = rb_entry(n, struct ulist_node, rb_node);
+		if (entry->val < start)
+			n = n->rb_right;
+		else if (entry)
+			n = n->rb_left;
+		else
+			break;
+	}
+	/* Empty changeset */
+	if (!entry)
+		goto out;
+
+	if (entry->val > start && rb_prev(&entry->rb_node))
+		entry = rb_entry(rb_prev(&entry->rb_node), struct ulist_node,
+				 rb_node);
+
+	n = &entry->rb_node;
+	while (n) {
+		struct rb_node *tmp = rb_next(n);
+		u64 entry_start;
+		u64 entry_end;
+		u64 entry_len;
+		int clear_ret;
+
+		entry = rb_entry(n, struct ulist_node, rb_node);
+		entry_start = entry->val;
+		entry_end = entry->aux;
+		entry_len = entry_end - entry_start + 1;
+
+		if (entry_start >= start + len)
+			break;
+		if (entry_start + entry_len <= start)
+			goto next;
+		/*
+		 * Now the entry is in [start, start + len), revert the
+		 * EXTENT_QGROUP_RESERVED bit.
+		 */
+		clear_ret = clear_extent_bits(&inode->io_tree, entry_start,
+					      entry_end, EXTENT_QGROUP_RESERVED);
+		if (!ret && clear_ret < 0)
+			ret = clear_ret;
+
+		ulist_del(&reserved->range_changed, entry->val, entry->aux);
+		if (likely(reserved->bytes_changed >= entry_len)) {
+			reserved->bytes_changed -= entry_len;
+		} else {
+			WARN_ON(1);
+			reserved->bytes_changed = 0;
+		}
+next:
+		n = tmp;
+	}
+out:
+	return ret;
+}
+
 /*
  * Reserve qgroup space for range [start, start + len).
  *
@@ -3457,18 +3522,14 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
  * Return <0 for error (including -EQUOT)
  *
  * NOTE: this function may sleep for memory allocation.
- *       if btrfs_qgroup_reserve_data() is called multiple times with
- *       same @reserved, caller must ensure when error happens it's OK
- *       to free *ALL* reserved space.
  */
 int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 			struct extent_changeset **reserved_ret, u64 start,
 			u64 len)
 {
 	struct btrfs_root *root = inode->root;
-	struct ulist_node *unode;
-	struct ulist_iterator uiter;
 	struct extent_changeset *reserved;
+	bool new_reserved = false;
 	u64 orig_reserved;
 	u64 to_reserve;
 	int ret;
@@ -3481,6 +3542,7 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	if (WARN_ON(!reserved_ret))
 		return -EINVAL;
 	if (!*reserved_ret) {
+		new_reserved = true;
 		*reserved_ret = extent_changeset_alloc();
 		if (!*reserved_ret)
 			return -ENOMEM;
@@ -3496,7 +3558,7 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	trace_btrfs_qgroup_reserve_data(&inode->vfs_inode, start, len,
 					to_reserve, QGROUP_RESERVE);
 	if (ret < 0)
-		goto cleanup;
+		goto out;
 	ret = qgroup_reserve(root, to_reserve, true, BTRFS_QGROUP_RSV_DATA);
 	if (ret < 0)
 		goto cleanup;
@@ -3504,15 +3566,13 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	return ret;
 
 cleanup:
-	/* cleanup *ALL* already reserved ranges */
-	ULIST_ITER_INIT(&uiter);
-	while ((unode = ulist_next(&reserved->range_changed, &uiter)))
-		clear_extent_bit(&inode->io_tree, unode->val,
-				 unode->aux, EXTENT_QGROUP_RESERVED, 0, 0, NULL);
-	/* Also free data bytes of already reserved one */
-	btrfs_qgroup_free_refroot(root->fs_info, root->root_key.objectid,
-				  orig_reserved, BTRFS_QGROUP_RSV_DATA);
-	extent_changeset_release(reserved);
+	qgroup_revert(inode, reserved, start, len);
+out:
+	if (new_reserved) {
+		extent_changeset_release(reserved);
+		kfree(reserved);
+		*reserved_ret = NULL;
+	}
 	return ret;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
  2020-07-08  6:24 [PATCH v3 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
  2020-07-08  6:24 ` [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges Qu Wenruo
@ 2020-07-08  6:24 ` Qu Wenruo
  2020-07-08 14:10   ` Josef Bacik
  2020-07-09 16:32   ` David Sterba
  2020-07-08  6:24 ` [PATCH v3 3/3] btrfs: qgroup: remove the ASYNC_COMMIT mechanism in favor of qgroup reserve retry-after-EDQUOT Qu Wenruo
  2 siblings, 2 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-07-08  6:24 UTC (permalink / raw)
  To: linux-btrfs

[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space.
One of the most obvious case is the the test case btrfs/153, which do
fallocate, then write into the preallocated range.

  btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
      --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
      +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
      @@ -1,2 +1,5 @@
       QA output created by 153
      +pwrite: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
       Silence is golden
      ...
      (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)

[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.

Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.

For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.

This leads to the -EDQUOT in buffered write routine.

And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/meta.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.

[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space by:
- Flush all inodes of the root
  NODATACOW writes will free the qgroup reserved at run_dealloc_range().
  However we don't have the infrastructure to only flush NODATACOW
  inodes, here we flush all inodes anyway.

- Wait ordered extents
  This would convert the preallocated metadata space into per-trans
  metadata, which can be freed in later transaction commit.

- Commit transaction
  This will free all per-trans metadata space.

Also we don't want to trigger flush too racy, so here we introduce a
per-root mutex to ensure if there is a running qgroup flushing, we wait
for it to end and don't start re-flush.

Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h   |   1 +
 fs/btrfs/disk-io.c |   1 +
 fs/btrfs/qgroup.c  | 101 +++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 95 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4dd478b4fe3a..891f47c7891f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1162,6 +1162,7 @@ struct btrfs_root {
 	spinlock_t qgroup_meta_rsv_lock;
 	u64 qgroup_meta_rsv_pertrans;
 	u64 qgroup_meta_rsv_prealloc;
+	struct mutex qgroup_flushing_mutex;
 
 	/* Number of active swapfiles */
 	atomic_t nr_swapfiles;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c27022f13150..0116e0b487c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1116,6 +1116,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
+	mutex_init(&root->qgroup_flushing_mutex);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 84a452dea3f9..207eb52f9d80 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3513,17 +3513,59 @@ static int qgroup_revert(struct btrfs_inode *inode,
 }
 
 /*
- * Reserve qgroup space for range [start, start + len).
+ * Try to free some space for qgroup.
  *
- * This function will either reserve space from related qgroups or doing
- * nothing if the range is already reserved.
+ * For qgroup, there are only 3 ways to free qgroup space:
+ * - Flush nodatacow write
+ *   Any nodatacow write will free its reserved data space at
+ *   run_delalloc_range().
+ *   In theory, we should only flush nodatacow inodes, but it's
+ *   not yet possible, so we need to flush the whole root.
  *
- * Return 0 for successful reserve
- * Return <0 for error (including -EQUOT)
+ * - Wait for ordered extents
+ *   When ordered extents are finished, their reserved metadata
+ *   is finally converted to per_trans status, which can be freed
+ *   by later commit transaction.
  *
- * NOTE: this function may sleep for memory allocation.
+ * - Commit transaction
+ *   This would free the meta_per_trans space.
+ *   In theory this shouldn't provide much space, but any more qgroup space
+ *   is needed.
  */
-int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+static int try_flush_qgroup(struct btrfs_root *root)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	/*
+	 * We don't want to run flush again and again, so if there is a running
+	 * one, we won't try to start a new flush, but exit directly.
+	 */
+	ret = mutex_trylock(&root->qgroup_flushing_mutex);
+	if (!ret) {
+		mutex_lock(&root->qgroup_flushing_mutex);
+		mutex_unlock(&root->qgroup_flushing_mutex);
+		return 0;
+	}
+
+	ret = btrfs_start_delalloc_snapshot(root);
+	if (ret < 0)
+		goto unlock;
+	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto unlock;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+unlock:
+	mutex_unlock(&root->qgroup_flushing_mutex);
+	return ret;
+}
+
+static int qgroup_reserve_data(struct btrfs_inode *inode,
 			struct extent_changeset **reserved_ret, u64 start,
 			u64 len)
 {
@@ -3576,6 +3618,34 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Reserve qgroup space for range [start, start + len).
+ *
+ * This function will either reserve space from related qgroups or doing
+ * nothing if the range is already reserved.
+ *
+ * Return 0 for successful reserve
+ * Return <0 for error (including -EQUOT)
+ *
+ * NOTE: This function may sleep for memory allocation, dirty page flushing and
+ *	 commit transaction. So caller should not hold any dirty page locked.
+ */
+int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+			struct extent_changeset **reserved_ret, u64 start,
+			u64 len)
+{
+	int ret;
+
+	ret = qgroup_reserve_data(inode, reserved_ret, start, len);
+	if (ret <= 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(inode->root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_data(inode, reserved_ret, start, len);
+}
+
 /* Free ranges specified by @reserved, normally in error path */
 static int qgroup_free_reserved_data(struct btrfs_inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len)
@@ -3744,7 +3814,7 @@ static int sub_root_meta_rsv(struct btrfs_root *root, int num_bytes,
 	return num_bytes;
 }
 
-int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 				enum btrfs_qgroup_rsv_type type, bool enforce)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -3771,6 +3841,21 @@ int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 	return ret;
 }
 
+int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+				enum btrfs_qgroup_rsv_type type, bool enforce)
+{
+	int ret;
+
+	ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
+	if (ret <= 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_meta(root, num_bytes, type, enforce);
+}
+
 void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/3] btrfs: qgroup: remove the ASYNC_COMMIT mechanism in favor of qgroup  reserve retry-after-EDQUOT
  2020-07-08  6:24 [PATCH v3 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
  2020-07-08  6:24 ` [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges Qu Wenruo
  2020-07-08  6:24 ` [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT Qu Wenruo
@ 2020-07-08  6:24 ` Qu Wenruo
  2020-07-09 16:38   ` David Sterba
  2 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-07-08  6:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

commit a514d63882c3 ("btrfs: qgroup: Commit transaction in advance to
reduce early EDQUOT") tries to reduce the early EDQUOT problems by
checking the qgroup free against threshold and try to wake up commit
kthread to free some space.

The problem of that mechanism is, it can only free qgroup per-trans
metadata space, can't do anything to data, nor prealloc qgroup space.

Now since we have the ability to flush qgroup space, and implements
retry-after-EDQUOT behavior, such mechanism is completely replaced.

So this patch will cleanup such mechanism in favor of
retry-after-EDQUOT.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h       |  5 -----
 fs/btrfs/disk-io.c     |  1 -
 fs/btrfs/qgroup.c      | 43 ++----------------------------------------
 fs/btrfs/transaction.c |  1 -
 fs/btrfs/transaction.h | 14 --------------
 5 files changed, 2 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 891f47c7891f..373567c168ac 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -545,11 +545,6 @@ enum {
 	 * (device replace, resize, device add/delete, balance)
 	 */
 	BTRFS_FS_EXCL_OP,
-	/*
-	 * To info transaction_kthread we need an immediate commit so it
-	 * doesn't need to wait for commit_interval
-	 */
-	BTRFS_FS_NEED_ASYNC_COMMIT,
 	/*
 	 * Indicate that balance has been set up from the ioctl and is in the
 	 * main phase. The fs_info::balance_ctl is initialized.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0116e0b487c9..3b80d88e6ab2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1746,7 +1746,6 @@ static int transaction_kthread(void *arg)
 
 		now = ktime_get_seconds();
 		if (cur->state < TRANS_STATE_COMMIT_START &&
-		    !test_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags) &&
 		    (now < cur->start_time ||
 		     now - cur->start_time < fs_info->commit_interval)) {
 			spin_unlock(&fs_info->trans_lock);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 207eb52f9d80..78d89dbdeba5 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -11,7 +11,6 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/btrfs.h>
-#include <linux/sizes.h>
 
 #include "ctree.h"
 #include "transaction.h"
@@ -2895,20 +2894,8 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 	return ret;
 }
 
-/*
- * Two limits to commit transaction in advance.
- *
- * For RATIO, it will be 1/RATIO of the remaining limit as threshold.
- * For SIZE, it will be in byte unit as threshold.
- */
-#define QGROUP_FREE_RATIO		32
-#define QGROUP_FREE_SIZE		SZ_32M
-static bool qgroup_check_limits(struct btrfs_fs_info *fs_info,
-				const struct btrfs_qgroup *qg, u64 num_bytes)
+static bool qgroup_check_limits(const struct btrfs_qgroup *qg, u64 num_bytes)
 {
-	u64 free;
-	u64 threshold;
-
 	if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) &&
 	    qgroup_rsv_total(qg) + (s64)qg->rfer + num_bytes > qg->max_rfer)
 		return false;
@@ -2917,32 +2904,6 @@ static bool qgroup_check_limits(struct btrfs_fs_info *fs_info,
 	    qgroup_rsv_total(qg) + (s64)qg->excl + num_bytes > qg->max_excl)
 		return false;
 
-	/*
-	 * Even if we passed the check, it's better to check if reservation
-	 * for meta_pertrans is pushing us near limit.
-	 * If there is too much pertrans reservation or it's near the limit,
-	 * let's try commit transaction to free some, using transaction_kthread
-	 */
-	if ((qg->lim_flags & (BTRFS_QGROUP_LIMIT_MAX_RFER |
-			      BTRFS_QGROUP_LIMIT_MAX_EXCL))) {
-		if (qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) {
-			free = qg->max_excl - qgroup_rsv_total(qg) - qg->excl;
-			threshold = min_t(u64, qg->max_excl / QGROUP_FREE_RATIO,
-					  QGROUP_FREE_SIZE);
-		} else {
-			free = qg->max_rfer - qgroup_rsv_total(qg) - qg->rfer;
-			threshold = min_t(u64, qg->max_rfer / QGROUP_FREE_RATIO,
-					  QGROUP_FREE_SIZE);
-		}
-
-		/*
-		 * Use transaction_kthread to commit transaction, so we no
-		 * longer need to bother nested transaction nor lock context.
-		 */
-		if (free < threshold)
-			btrfs_commit_transaction_locksafe(fs_info);
-	}
-
 	return true;
 }
 
@@ -2990,7 +2951,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce,
 
 		qg = unode_aux_to_qgroup(unode);
 
-		if (enforce && !qgroup_check_limits(fs_info, qg, num_bytes)) {
+		if (enforce && !qgroup_check_limits(qg, num_bytes)) {
 			ret = -EDQUOT;
 			goto out;
 		}
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index b359d4b17658..d0a6150bf82d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2351,7 +2351,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	 */
 	cur_trans->state = TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
-	clear_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags);
 
 	spin_lock(&fs_info->trans_lock);
 	list_del_init(&cur_trans->list);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 6f65fff6cf50..0b18d25aa9b6 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -208,20 +208,6 @@ int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root);
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans);
 int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
 				   int wait_for_unblock);
-
-/*
- * Try to commit transaction asynchronously, so this is safe to call
- * even holding a spinlock.
- *
- * It's done by informing transaction_kthread to commit transaction without
- * waiting for commit interval.
- */
-static inline void btrfs_commit_transaction_locksafe(
-		struct btrfs_fs_info *fs_info)
-{
-	set_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags);
-	wake_up_process(fs_info->transaction_kthread);
-}
 int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans);
 int btrfs_should_end_transaction(struct btrfs_trans_handle *trans);
 void btrfs_throttle(struct btrfs_fs_info *fs_info);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges
  2020-07-08  6:24 ` [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges Qu Wenruo
@ 2020-07-08 14:09   ` Josef Bacik
  2020-07-08 23:21     ` Qu Wenruo
  2020-07-09 16:02   ` David Sterba
  1 sibling, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2020-07-08 14:09 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 7/8/20 2:24 AM, Qu Wenruo wrote:
> [PROBLEM]
> Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
> reserved space of the changeset.
> 
> For example:
> 	ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
> 	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
> 	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
> 
> If the last btrfs_qgroup_reserve_data() failed, it will release all [0,
> 3M) range.
> 
> This behavior is kinda OK for now, as when we hit -EDQUOT, we normally
> go error handling and need to release all reserved ranges anyway.
> 
> But this also means the following call is not possible:
> 	ret = btrfs_qgroup_reserve_data();
> 	if (ret == -EDQUOT) {
> 		/* Do something to free some qgroup space */
> 		ret = btrfs_qgroup_reserve_data();
> 	}
> 
> As if the first btrfs_qgroup_reserve_data() fails, it will free all
> reserved qgroup space.
> 
> [CAUSE]
> This is because we release all reserved ranges when
> btrfs_qgroup_reserve_data() fails.
> 
> [FIX]
> This patch will implement a new function, qgroup_revert(), to iterate
> through the ulist nodes, to find any nodes in the failure range, and
> remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and decrease
> the extent_changeset::bytes_changed, so that we can revert to previous
> status.
> 
> This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
> happens.
> 
> Suggested-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/qgroup.c | 90 +++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 75 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index ee0ad33b659c..84a452dea3f9 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -3447,6 +3447,71 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
>   	}
>   }
>   
> +static int qgroup_revert(struct btrfs_inode *inode,
> +			struct extent_changeset *reserved, u64 start,
> +			u64 len)
> +{
> +	struct rb_node *n = reserved->range_changed.root.rb_node;
> +	struct ulist_node *entry = NULL;
> +	int ret = 0;
> +
> +	while (n) {
> +		entry = rb_entry(n, struct ulist_node, rb_node);
> +		if (entry->val < start)
> +			n = n->rb_right;
> +		else if (entry)
> +			n = n->rb_left;
> +		else
> +			break;
> +	}
> +	/* Empty changeset */
> +	if (!entry)
> +		goto out;

Don't need the goto out here, just return ret;

> +
> +	if (entry->val > start && rb_prev(&entry->rb_node))
> +		entry = rb_entry(rb_prev(&entry->rb_node), struct ulist_node,
> +				 rb_node);
> +
> +	n = &entry->rb_node;
> +	while (n) {

for (n = &entry->rb_node; n; n = rb_next(n)) {

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
  2020-07-08  6:24 ` [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT Qu Wenruo
@ 2020-07-08 14:10   ` Josef Bacik
  2020-07-09 16:32   ` David Sterba
  1 sibling, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-07-08 14:10 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 7/8/20 2:24 AM, Qu Wenruo wrote:
> [PROBLEM]
> There are known problem related to how btrfs handles qgroup reserved
> space.
> One of the most obvious case is the the test case btrfs/153, which do
> fallocate, then write into the preallocated range.
> 
>    btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
>        @@ -1,2 +1,5 @@
>         QA output created by 153
>        +pwrite: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>         Silence is golden
>        ...
>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
> 
> [CAUSE]
> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
> we always reserve space no matter if it's COW or not.
> 
> Such behavior change is mostly for performance, and reverting it is not
> a good idea anyway.
> 
> For preallcoated extent, we reserve qgroup data space for it already,
> and since we also reserve data space for qgroup at buffered write time,
> it needs twice the space for us to write into preallocated space.
> 
> This leads to the -EDQUOT in buffered write routine.
> 
> And we can't follow the same solution, unlike data/meta space check,
> qgroup reserved space is shared between data/meta.
> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
> check after qgroup reservation failure is not a solution.
> 
> [FIX]
> To solve the problem, we don't return -EDQUOT directly, but every time
> we got a -EDQUOT, we try to flush qgroup space by:
> - Flush all inodes of the root
>    NODATACOW writes will free the qgroup reserved at run_dealloc_range().
>    However we don't have the infrastructure to only flush NODATACOW
>    inodes, here we flush all inodes anyway.
> 
> - Wait ordered extents
>    This would convert the preallocated metadata space into per-trans
>    metadata, which can be freed in later transaction commit.
> 
> - Commit transaction
>    This will free all per-trans metadata space.
> 
> Also we don't want to trigger flush too racy, so here we introduce a
> per-root mutex to ensure if there is a running qgroup flushing, we wait
> for it to end and don't start re-flush.
> 
> Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges
  2020-07-08 14:09   ` Josef Bacik
@ 2020-07-08 23:21     ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-07-08 23:21 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2020/7/8 下午10:09, Josef Bacik wrote:
> On 7/8/20 2:24 AM, Qu Wenruo wrote:
>> [PROBLEM]
>> Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
>> reserved space of the changeset.
>>
>> For example:
>>     ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
>>     ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
>>     ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
>>
>> If the last btrfs_qgroup_reserve_data() failed, it will release all [0,
>> 3M) range.
>>
>> This behavior is kinda OK for now, as when we hit -EDQUOT, we normally
>> go error handling and need to release all reserved ranges anyway.
>>
>> But this also means the following call is not possible:
>>     ret = btrfs_qgroup_reserve_data();
>>     if (ret == -EDQUOT) {
>>         /* Do something to free some qgroup space */
>>         ret = btrfs_qgroup_reserve_data();
>>     }
>>
>> As if the first btrfs_qgroup_reserve_data() fails, it will free all
>> reserved qgroup space.
>>
>> [CAUSE]
>> This is because we release all reserved ranges when
>> btrfs_qgroup_reserve_data() fails.
>>
>> [FIX]
>> This patch will implement a new function, qgroup_revert(), to iterate
>> through the ulist nodes, to find any nodes in the failure range, and
>> remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and decrease
>> the extent_changeset::bytes_changed, so that we can revert to previous
>> status.
>>
>> This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
>> happens.
>>
>> Suggested-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/qgroup.c | 90 +++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 75 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index ee0ad33b659c..84a452dea3f9 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -3447,6 +3447,71 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info
>> *fs_info)
>>       }
>>   }
>>   +static int qgroup_revert(struct btrfs_inode *inode,
>> +            struct extent_changeset *reserved, u64 start,
>> +            u64 len)
>> +{
>> +    struct rb_node *n = reserved->range_changed.root.rb_node;
>> +    struct ulist_node *entry = NULL;
>> +    int ret = 0;
>> +
>> +    while (n) {
>> +        entry = rb_entry(n, struct ulist_node, rb_node);
>> +        if (entry->val < start)
>> +            n = n->rb_right;
>> +        else if (entry)
>> +            n = n->rb_left;
>> +        else
>> +            break;
>> +    }
>> +    /* Empty changeset */
>> +    if (!entry)
>> +        goto out;
>
> Don't need the goto out here, just return ret;
>
>> +
>> +    if (entry->val > start && rb_prev(&entry->rb_node))
>> +        entry = rb_entry(rb_prev(&entry->rb_node), struct ulist_node,
>> +                 rb_node);
>> +
>> +    n = &entry->rb_node;
>> +    while (n) {
>
> for (n = &entry->rb_node; n; n = rb_next(n)) {

You get into the trap!

Since @n can be deleted from the tree, it needs the @tmp to record
what's the real next node.

But I can still use for loop with @tmp involved though.

Thanks,
Qu
>
> Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges
  2020-07-08  6:24 ` [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges Qu Wenruo
  2020-07-08 14:09   ` Josef Bacik
@ 2020-07-09 16:02   ` David Sterba
  2020-07-13  0:01     ` Qu Wenruo
  1 sibling, 1 reply; 13+ messages in thread
From: David Sterba @ 2020-07-09 16:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Wed, Jul 08, 2020 at 02:24:45PM +0800, Qu Wenruo wrote:
> [PROBLEM]
> Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
> reserved space of the changeset.
> 
> For example:
> 	ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
> 	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
> 	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
> 
> If the last btrfs_qgroup_reserve_data() failed, it will release all [0,
> 3M) range.
> 
> This behavior is kinda OK for now, as when we hit -EDQUOT, we normally
> go error handling and need to release all reserved ranges anyway.
> 
> But this also means the following call is not possible:
> 	ret = btrfs_qgroup_reserve_data();
> 	if (ret == -EDQUOT) {
> 		/* Do something to free some qgroup space */
> 		ret = btrfs_qgroup_reserve_data();
> 	}
> 
> As if the first btrfs_qgroup_reserve_data() fails, it will free all
> reserved qgroup space.
> 
> [CAUSE]
> This is because we release all reserved ranges when
> btrfs_qgroup_reserve_data() fails.
> 
> [FIX]
> This patch will implement a new function, qgroup_revert(), to iterate
> through the ulist nodes, to find any nodes in the failure range, and
> remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and decrease
> the extent_changeset::bytes_changed, so that we can revert to previous
> status.
> 
> This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
> happens.
> 
> Suggested-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/qgroup.c | 90 +++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 75 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index ee0ad33b659c..84a452dea3f9 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -3447,6 +3447,71 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
>  	}
>  }
>  
> +static int qgroup_revert(struct btrfs_inode *inode,
> +			struct extent_changeset *reserved, u64 start,
> +			u64 len)

I've renamed it to qgroup_unreserve_range, as it's not clear what is
being reverted.

> +{
> +	struct rb_node *n = reserved->range_changed.root.rb_node;
> +	struct ulist_node *entry = NULL;
> +	int ret = 0;
> +
> +	while (n) {
> +		entry = rb_entry(n, struct ulist_node, rb_node);
> +		if (entry->val < start)
> +			n = n->rb_right;
> +		else if (entry)
> +			n = n->rb_left;

Please don't use single letter variables in new code unless it's 'i' for
integer indexing. 'node' is fine.

> +		else
> +			break;
> +	}
> +	/* Empty changeset */
> +	if (!entry)
> +		goto out;

Switched to return as suggested.

> +
> +	if (entry->val > start && rb_prev(&entry->rb_node))
> +		entry = rb_entry(rb_prev(&entry->rb_node), struct ulist_node,
> +				 rb_node);
> +
> +	n = &entry->rb_node;
> +	while (n) {
> +		struct rb_node *tmp = rb_next(n);

Renamed to 'next'

All non-functional changes, no need to resend.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
  2020-07-08  6:24 ` [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT Qu Wenruo
  2020-07-08 14:10   ` Josef Bacik
@ 2020-07-09 16:32   ` David Sterba
  2020-07-09 17:40     ` David Sterba
  2020-07-09 23:06     ` Qu Wenruo
  1 sibling, 2 replies; 13+ messages in thread
From: David Sterba @ 2020-07-09 16:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Jul 08, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
> +static int try_flush_qgroup(struct btrfs_root *root)
> +{
> +	struct btrfs_trans_handle *trans;
> +	int ret;
> +
> +	/*
> +	 * We don't want to run flush again and again, so if there is a running
> +	 * one, we won't try to start a new flush, but exit directly.
> +	 */
> +	ret = mutex_trylock(&root->qgroup_flushing_mutex);
> +	if (!ret) {
> +		mutex_lock(&root->qgroup_flushing_mutex);
> +		mutex_unlock(&root->qgroup_flushing_mutex);

This is abuse of mutex, for status tracking "is somebody flushing" and
for waiting until it's over.

We have many root::status bits (the BTRFS_ROOT_* namespace) so that
qgroups are flushing should another one. The bit atomically set when it
starts and cleared when it ends.

All waiting tasks should queue in a normal wait_queue_head.

> +		return 0;
> +	}
> +
> +	ret = btrfs_start_delalloc_snapshot(root);
> +	if (ret < 0)
> +		goto unlock;
> +	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
> +
> +	trans = btrfs_join_transaction(root);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		goto unlock;
> +	}
> +
> +	ret = btrfs_commit_transaction(trans);
> +unlock:
> +	mutex_unlock(&root->qgroup_flushing_mutex);

And also the whole wait/join/commit combo is in one huge mutex, that's
really an anti-pattern.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/3] btrfs: qgroup: remove the ASYNC_COMMIT mechanism in favor of qgroup  reserve retry-after-EDQUOT
  2020-07-08  6:24 ` [PATCH v3 3/3] btrfs: qgroup: remove the ASYNC_COMMIT mechanism in favor of qgroup reserve retry-after-EDQUOT Qu Wenruo
@ 2020-07-09 16:38   ` David Sterba
  0 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2020-07-09 16:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Wed, Jul 08, 2020 at 02:24:47PM +0800, Qu Wenruo wrote:
> commit a514d63882c3 ("btrfs: qgroup: Commit transaction in advance to
> reduce early EDQUOT") tries to reduce the early EDQUOT problems by
> checking the qgroup free against threshold and try to wake up commit
> kthread to free some space.
> 
> The problem of that mechanism is, it can only free qgroup per-trans
> metadata space, can't do anything to data, nor prealloc qgroup space.
> 
> Now since we have the ability to flush qgroup space, and implements
> retry-after-EDQUOT behavior, such mechanism is completely replaced.
> 
> So this patch will cleanup such mechanism in favor of
> retry-after-EDQUOT.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>

That one looks good to me.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
  2020-07-09 16:32   ` David Sterba
@ 2020-07-09 17:40     ` David Sterba
  2020-07-09 23:06     ` Qu Wenruo
  1 sibling, 0 replies; 13+ messages in thread
From: David Sterba @ 2020-07-09 17:40 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

On Thu, Jul 09, 2020 at 06:32:46PM +0200, David Sterba wrote:
> On Wed, Jul 08, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
> > -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
> > +static int try_flush_qgroup(struct btrfs_root *root)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	int ret;
> > +
> > +	/*
> > +	 * We don't want to run flush again and again, so if there is a running
> > +	 * one, we won't try to start a new flush, but exit directly.
> > +	 */
> > +	ret = mutex_trylock(&root->qgroup_flushing_mutex);
> > +	if (!ret) {
> > +		mutex_lock(&root->qgroup_flushing_mutex);
> > +		mutex_unlock(&root->qgroup_flushing_mutex);
> 
> This is abuse of mutex, for status tracking "is somebody flushing" and
> for waiting until it's over.
> 
> We have many root::status bits (the BTRFS_ROOT_* namespace) so that
> qgroups are flushing should another one. The bit atomically set when it
> starts and cleared when it ends.
> 
> All waiting tasks should queue in a normal wait_queue_head.

Something like that (untested):

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cc1fcaa7cfa..5dbb6b7300b7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1007,6 +1007,12 @@ enum {
 	BTRFS_ROOT_DEAD_TREE,
 	/* The root has a log tree. Used only for subvolume roots. */
 	BTRFS_ROOT_HAS_LOG_TREE,
+
+	/*
+	 * Indicate that qgroup flushing is in progress to prevent multiple
+	 * processes attempting that
+	 */
+	BTRFS_ROOT_QGROUP_FLUSHING,
 };
 
 /*
@@ -1159,7 +1165,7 @@ struct btrfs_root {
 	spinlock_t qgroup_meta_rsv_lock;
 	u64 qgroup_meta_rsv_pertrans;
 	u64 qgroup_meta_rsv_prealloc;
-	struct mutex qgroup_flushing_mutex;
+	wait_queue_head_t qgroup_flush_wait;
 
 	/* Number of active swapfiles */
 	atomic_t nr_swapfiles;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8029127df537..e124e3376208 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1116,7 +1116,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
-	mutex_init(&root->qgroup_flushing_mutex);
+	init_waitqueue_head(&root->qgroup_flush_wait);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 494ab2b1bbf2..95695aca7aa9 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3503,10 +3503,9 @@ static int try_flush_qgroup(struct btrfs_root *root)
 	 * We don't want to run flush again and again, so if there is a running
 	 * one, we won't try to start a new flush, but exit directly.
 	 */
-	ret = mutex_trylock(&root->qgroup_flushing_mutex);
-	if (!ret) {
-		mutex_lock(&root->qgroup_flushing_mutex);
-		mutex_unlock(&root->qgroup_flushing_mutex);
+	if (test_and_set_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)) {
+		wait_event(root->qgroup_flush_wait,
+			!test_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state));
 		return 0;
 	}
 
@@ -3523,7 +3522,7 @@ static int try_flush_qgroup(struct btrfs_root *root)
 
 	ret = btrfs_commit_transaction(trans);
 unlock:
-	mutex_unlock(&root->qgroup_flushing_mutex);
+	clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
  2020-07-09 16:32   ` David Sterba
  2020-07-09 17:40     ` David Sterba
@ 2020-07-09 23:06     ` Qu Wenruo
  1 sibling, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-07-09 23:06 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2103 bytes --]



On 2020/7/10 上午12:32, David Sterba wrote:
> On Wed, Jul 08, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
>> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>> +static int try_flush_qgroup(struct btrfs_root *root)
>> +{
>> +	struct btrfs_trans_handle *trans;
>> +	int ret;
>> +
>> +	/*
>> +	 * We don't want to run flush again and again, so if there is a running
>> +	 * one, we won't try to start a new flush, but exit directly.
>> +	 */
>> +	ret = mutex_trylock(&root->qgroup_flushing_mutex);
>> +	if (!ret) {
>> +		mutex_lock(&root->qgroup_flushing_mutex);
>> +		mutex_unlock(&root->qgroup_flushing_mutex);
> 
> This is abuse of mutex, for status tracking "is somebody flushing" and
> for waiting until it's over.
> 
> We have many root::status bits (the BTRFS_ROOT_* namespace) so that
> qgroups are flushing should another one. The bit atomically set when it
> starts and cleared when it ends.

In fact, I want to avoid plain wait_queue usage if possible.

Unlike mutex, wait_queue doesn't have good enough debug mechanism like
lockdep.

I see no reason re-implementing the existing mutex code by ourselves
could bring any benefit here.

It may looks like an abuse of mutex, but I could wrap it into something
like wait_or_lock_mutex(), which may slightly improve the readability.

Or am I missing anything?

> 
> All waiting tasks should queue in a normal wait_queue_head.
> 
>> +		return 0;
>> +	}
>> +
>> +	ret = btrfs_start_delalloc_snapshot(root);
>> +	if (ret < 0)
>> +		goto unlock;
>> +	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>> +
>> +	trans = btrfs_join_transaction(root);
>> +	if (IS_ERR(trans)) {
>> +		ret = PTR_ERR(trans);
>> +		goto unlock;
>> +	}
>> +
>> +	ret = btrfs_commit_transaction(trans);
>> +unlock:
>> +	mutex_unlock(&root->qgroup_flushing_mutex);
> 
> And also the whole wait/join/commit combo is in one huge mutex, that's
> really an anti-pattern.
> 
But that mutex is per-root, and is the slow path.

Converting to wait_queue won't change the pattern either.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges
  2020-07-09 16:02   ` David Sterba
@ 2020-07-13  0:01     ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-07-13  0:01 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik


[-- Attachment #1.1: Type: text/plain, Size: 3374 bytes --]



On 2020/7/10 上午12:02, David Sterba wrote:
> On Wed, Jul 08, 2020 at 02:24:45PM +0800, Qu Wenruo wrote:
>> [PROBLEM]
>> Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
>> reserved space of the changeset.
>>
>> For example:
>> 	ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
>> 	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
>> 	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
>>
>> If the last btrfs_qgroup_reserve_data() failed, it will release all [0,
>> 3M) range.
>>
>> This behavior is kinda OK for now, as when we hit -EDQUOT, we normally
>> go error handling and need to release all reserved ranges anyway.
>>
>> But this also means the following call is not possible:
>> 	ret = btrfs_qgroup_reserve_data();
>> 	if (ret == -EDQUOT) {
>> 		/* Do something to free some qgroup space */
>> 		ret = btrfs_qgroup_reserve_data();
>> 	}
>>
>> As if the first btrfs_qgroup_reserve_data() fails, it will free all
>> reserved qgroup space.
>>
>> [CAUSE]
>> This is because we release all reserved ranges when
>> btrfs_qgroup_reserve_data() fails.
>>
>> [FIX]
>> This patch will implement a new function, qgroup_revert(), to iterate
>> through the ulist nodes, to find any nodes in the failure range, and
>> remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and decrease
>> the extent_changeset::bytes_changed, so that we can revert to previous
>> status.
>>
>> This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
>> happens.
>>
>> Suggested-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/qgroup.c | 90 +++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 75 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index ee0ad33b659c..84a452dea3f9 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -3447,6 +3447,71 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
>>  	}
>>  }
>>  
>> +static int qgroup_revert(struct btrfs_inode *inode,
>> +			struct extent_changeset *reserved, u64 start,
>> +			u64 len)
> 
> I've renamed it to qgroup_unreserve_range, as it's not clear what is
> being reverted.
> 
>> +{
>> +	struct rb_node *n = reserved->range_changed.root.rb_node;
>> +	struct ulist_node *entry = NULL;
>> +	int ret = 0;
>> +
>> +	while (n) {
>> +		entry = rb_entry(n, struct ulist_node, rb_node);
>> +		if (entry->val < start)
>> +			n = n->rb_right;
>> +		else if (entry)
>> +			n = n->rb_left;
> 
> Please don't use single letter variables in new code unless it's 'i' for
> integer indexing. 'node' is fine.
> 
>> +		else
>> +			break;
>> +	}
>> +	/* Empty changeset */
>> +	if (!entry)
>> +		goto out;
> 
> Switched to return as suggested.
> 
>> +
>> +	if (entry->val > start && rb_prev(&entry->rb_node))
>> +		entry = rb_entry(rb_prev(&entry->rb_node), struct ulist_node,
>> +				 rb_node);
>> +
>> +	n = &entry->rb_node;
>> +	while (n) {
>> +		struct rb_node *tmp = rb_next(n);
> 
> Renamed to 'next'
> 
> All non-functional changes, no need to resend.
> 
I haven't see it in misc-next yet, and consider the remaining two
patches will need some modification anyway, would you mind to update the
patchset?

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-07-13  0:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-08  6:24 [PATCH v3 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
2020-07-08  6:24 ` [PATCH v3 1/3] btrfs: qgroup: allow btrfs_qgroup_reserve_data() to revert the range it just set without releasing other existing ranges Qu Wenruo
2020-07-08 14:09   ` Josef Bacik
2020-07-08 23:21     ` Qu Wenruo
2020-07-09 16:02   ` David Sterba
2020-07-13  0:01     ` Qu Wenruo
2020-07-08  6:24 ` [PATCH v3 2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT Qu Wenruo
2020-07-08 14:10   ` Josef Bacik
2020-07-09 16:32   ` David Sterba
2020-07-09 17:40     ` David Sterba
2020-07-09 23:06     ` Qu Wenruo
2020-07-08  6:24 ` [PATCH v3 3/3] btrfs: qgroup: remove the ASYNC_COMMIT mechanism in favor of qgroup reserve retry-after-EDQUOT Qu Wenruo
2020-07-09 16:38   ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.