All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
@ 2017-02-27  7:10 Qu Wenruo
  2017-02-27  7:10 ` [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space Qu Wenruo
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

Pull request can be fetched from my github:
https://github.com/adam900710/linux.git qgroup_fixes_for_4.11 

The base is 6288d6eabc7505f42dda34a2c2962f91914be3a4.
Author: Liu Bo <bo.li.liu@oracle.com>
Date:   Tue Feb 21 12:12:58 2017 -0800

    Btrfs: use the correct type when creating cow dio extent

These patches are already in mail list for almost 3 months.
For example the tracepoint patch is last updated at 2016/12/09.

With this patchset btrfs can pass all xfstest qgroup tests now.

This pull request should be good for 4.11 as they are all bug fixes and
has been delayed for several times.

I don't know if these patchset will be delayed again if someone wants to
cleanup something else, and cause rebase conflicts to delay such fixes.
But I must say, that's very frustrating to see bug fixes just get dropped
again and again just due to new cleanups and lack of reviewers.

Despite all these pities, this pull request includes:
1) Fix for inode_cache mount option
   Although no one really cares inode_cache mount option, it will screw
   qgroup for a long time.
   Not only it will screw up qgroup test uses golden output, but also
   new test cases use btrfsck to verify qgroup.

2) Fix for btrfs/104 kernel warning
   This is caused by quota enabling with dirty buffers not written to
   disc.

   Fixed by checking EXTENT_QGROUP_RESERVED flag other than just
   decreasing qgroup->reserved.

3) Fix qgroup underflow caused by freeing ranges not reserved by caller
   Mainly exposed by Chandan on PPC64.

   It's possible that buffered write is blocked by reserving metadata,
   and in error path we will release reserved space for both data and
   metadata.

   However the reserved data can already be reserved by another process
   writing the same page.

   In that case, data reserved space can be freed by two process, one
   for error path, one for normal write routine, causing underflow.

   Fixed by checking if that data range is reserved by ourselves and
   only free it if it's reserved by ourselves.

Update since 2016/12/09:
  Rebased to latest for-linux-4.11.

  Add missing reviewed-by and tested-by tags.

  Add more comment for btrfs_qgroup_reserve_data() for error handling.

  Add more comment for qgroup_free_reserved_data() for later
  enhancement (not function enhancement).

Qu Wenruo (9):
  btrfs: qgroup: Add trace point for qgroup reserved space
  btrfs: qgroup: Re-arrange tracepoint timing to co-operate with
    reserved space tracepoint
  btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount
    option
  btrfs: qgroup: Add quick exit for non-fs extents
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  btrfs: qgroup: Return actually freed bytes for qgroup release or free
    data
  btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered
    write and quota enable
  btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  btrfs: qgroup: Fix qgroup reserved space underflow by only freeing
    reserved ranges

 fs/btrfs/ctree.h             |  12 ++-
 fs/btrfs/extent-tree.c       |  31 +++---
 fs/btrfs/extent_io.h         |  14 ++-
 fs/btrfs/file.c              |  46 +++++----
 fs/btrfs/inode-map.c         |   6 +-
 fs/btrfs/inode.c             |  64 +++++++++----
 fs/btrfs/ioctl.c             |  11 ++-
 fs/btrfs/qgroup.c            | 221 ++++++++++++++++++++++++++++++++-----------
 fs/btrfs/qgroup.h            |  14 +--
 fs/btrfs/relocation.c        |  13 ++-
 fs/btrfs/transaction.c       |  21 ++--
 include/trace/events/btrfs.h |  43 +++++++++
 12 files changed, 358 insertions(+), 138 deletions(-)

-- 
2.11.1




^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-03-07 15:19   ` David Sterba
  2017-03-07 16:30   ` Jeff Mahoney
  2017-02-27  7:10 ` [PATCH 2/9] btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint Qu Wenruo
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

Introduce the following trace points:
qgroup_update_reserve
qgroup_meta_reserve

These trace points are handy to trace qgroup reserve space related
problems.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/qgroup.c            | 15 +++++++++++++++
 include/trace/events/btrfs.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a5da750c1087..b303d4794026 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1075,6 +1075,8 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info *fs_info,
 	qgroup->excl += sign * num_bytes;
 	qgroup->excl_cmpr += sign * num_bytes;
 	if (sign > 0) {
+		trace_qgroup_update_reserve(fs_info, qgroup->qgroupid,
+					    qgroup->reserved, -(s64)num_bytes);
 		if (WARN_ON(qgroup->reserved < num_bytes))
 			report_reserved_underflow(fs_info, qgroup, num_bytes);
 		else
@@ -1100,6 +1102,9 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info *fs_info,
 		WARN_ON(sign < 0 && qgroup->excl < num_bytes);
 		qgroup->excl += sign * num_bytes;
 		if (sign > 0) {
+			trace_qgroup_update_reserve(fs_info, qgroup->qgroupid,
+						    qgroup->reserved,
+						    -(s64)num_bytes);
 			if (WARN_ON(qgroup->reserved < num_bytes))
 				report_reserved_underflow(fs_info, qgroup,
 							  num_bytes);
@@ -2424,6 +2429,8 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce)
 
 		qg = unode_aux_to_qgroup(unode);
 
+		trace_qgroup_update_reserve(fs_info, qg->qgroupid,
+					    qg->reserved, num_bytes);
 		qg->reserved += num_bytes;
 	}
 
@@ -2469,6 +2476,8 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
 
 		qg = unode_aux_to_qgroup(unode);
 
+		trace_qgroup_update_reserve(fs_info, qg->qgroupid,
+					    qg->reserved, -(s64)num_bytes);
 		if (WARN_ON(qg->reserved < num_bytes))
 			report_reserved_underflow(fs_info, qg, num_bytes);
 		else
@@ -2945,6 +2954,8 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 		return 0;
 
 	BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
+	trace_qgroup_meta_reserve(fs_info, root->objectid,
+				  (s64)num_bytes);
 	ret = qgroup_reserve(root, num_bytes, enforce);
 	if (ret < 0)
 		return ret;
@@ -2964,6 +2975,8 @@ void btrfs_qgroup_free_meta_all(struct btrfs_root *root)
 	reserved = atomic_xchg(&root->qgroup_meta_rsv, 0);
 	if (reserved == 0)
 		return;
+	trace_qgroup_meta_reserve(root->fs_info, root->objectid,
+				  -(s64)reserved);
 	btrfs_qgroup_free_refroot(fs_info, root->objectid, reserved);
 }
 
@@ -2978,6 +2991,8 @@ void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes)
 	BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
 	WARN_ON(atomic_read(&root->qgroup_meta_rsv) < num_bytes);
 	atomic_sub(num_bytes, &root->qgroup_meta_rsv);
+	trace_qgroup_meta_reserve(root->fs_info, root->objectid,
+				  -(s64)num_bytes);
 	btrfs_qgroup_free_refroot(fs_info, root->objectid, num_bytes);
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index a3c3cab643a9..2d5799d8a4a2 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1475,6 +1475,49 @@ TRACE_EVENT(qgroup_update_counters,
 		  __entry->cur_new_count)
 );
 
+TRACE_EVENT(qgroup_update_reserve,
+
+	TP_PROTO(struct btrfs_fs_info *fs_info, u64 qgid, u64 cur_reserved,
+		 s64 diff),
+
+	TP_ARGS(fs_info, qgid, cur_reserved, diff),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	qgid			)
+		__field(	u64,	cur_reserved		)
+		__field(	s64,	diff			)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->qgid		= qgid;
+		__entry->cur_reserved	= cur_reserved;
+		__entry->diff		= diff;
+	),
+
+	TP_printk_btrfs("qgid = %llu, cur_reserved = %llu, diff = %lld",
+		__entry->qgid, __entry->cur_reserved, __entry->diff)
+);
+
+TRACE_EVENT(qgroup_meta_reserve,
+
+	TP_PROTO(struct btrfs_fs_info *fs_info, u64 refroot, s64 diff),
+
+	TP_ARGS(fs_info, refroot, diff),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	refroot			)
+		__field(	s64,	diff			)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->refroot	= refroot;
+		__entry->diff		= diff;
+	),
+
+	TP_printk_btrfs("refroot = %llu, diff = %lld",
+		__entry->refroot, __entry->diff)
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/9] btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
  2017-02-27  7:10 ` [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-03-07 19:11   ` Jeff Mahoney
  2017-02-27  7:10 ` [PATCH 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option Qu Wenruo
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

Newly introduced qgroup reserved space trace points are normally nested
into several common qgroup operations.

While some other trace points are not well placed to co-operate with
them, causing confusing output.

This patch re-arrange trace_btrfs_qgroup_release_data() and
trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
before reserved space ones.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/qgroup.c | 10 +++++-----
 fs/btrfs/qgroup.h |  6 +-----
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index b303d4794026..e46499cafdbf 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2895,14 +2895,14 @@ static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len,
 	if (ret < 0)
 		goto out;
 
-	if (free) {
-		btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info,
-				BTRFS_I(inode)->root->objectid,
-				changeset.bytes_changed);
+	if (free)
 		trace_op = QGROUP_FREE;
-	}
 	trace_btrfs_qgroup_release_data(inode, start, len,
 					changeset.bytes_changed, trace_op);
+	if (free)
+		btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info,
+				BTRFS_I(inode)->root->objectid,
+				changeset.bytes_changed);
 out:
 	ulist_release(&changeset.range_changed);
 	return ret;
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 26932a8a1993..92608edf0e89 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -186,15 +186,11 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans,
 			 struct btrfs_qgroup_inherit *inherit);
 void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
 			       u64 ref_root, u64 num_bytes);
-/*
- * TODO: Add proper trace point for it, as btrfs_qgroup_free() is
- * called by everywhere, can't provide good trace for delayed ref case.
- */
 static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info,
 						 u64 ref_root, u64 num_bytes)
 {
-	btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
 	trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
+	btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
 }
 void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);
 
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
  2017-02-27  7:10 ` [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space Qu Wenruo
  2017-02-27  7:10 ` [PATCH 2/9] btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-03-07 19:21   ` Jeff Mahoney
  2017-02-27  7:10 ` [PATCH 4/9] btrfs: qgroup: Add quick exit for non-fs extents Qu Wenruo
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

[BUG]
The easist way to reproduce the bug is:
------
 # mkfs.btrfs -f $dev -n 16K
 # mount $dev $mnt -o inode_cache
 # btrfs quota enable $mnt
 # btrfs quota rescan -w $mnt
 # btrfs qgroup show $mnt
qgroupid         rfer         excl
--------         ----         ----
0/5          32.00KiB     32.00KiB
             ^^ Twice the correct value
------

And fstests/btrfs qgroup test group can easily detect them with
inode_cache mount option.
Although some of them are false alerts since old test cases are using
fixed golden output.
While new test cases will use "btrfs check" to detect qgroup mismatch.

[CAUSE]
Inode_cache mount option will make commit_fs_roots() to call
btrfs_save_ino_cache() to update fs/subvol trees, and generate new
delayed refs.

However we call btrfs_qgroup_prepare_account_extents() too early, before
commit_fs_roots().
This makes the "old_roots" for newly generated extents are always NULL.
For freeing extent case, this makes both new_roots and old_roots to be
empty, while correct old_roots should not be empty.
This causing qgroup numbers not decreased correctly.

[FIX]
Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
just before btrfs_qgroup_account_extents(), and add needed delayed_refs
handler.
So qgroup can handle inode_map mount options correctly.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/transaction.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 6b3e0fc2fe7a..1ff3ec797356 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2130,13 +2130,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto scrub_continue;
 	}
 
-	/* Reocrd old roots for later qgroup accounting */
-	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-	if (ret) {
-		mutex_unlock(&fs_info->reloc_mutex);
-		goto scrub_continue;
-	}
-
 	/*
 	 * make sure none of the code above managed to slip in a
 	 * delayed item
@@ -2179,6 +2172,24 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	btrfs_free_log_root_tree(trans, fs_info);
 
 	/*
+	 * commit_fs_roots() can call btrfs_save_ino_cache(), which generates
+	 * new delayed refs. Must handle them or qgroup can be wrong.
+	 */
+	ret = btrfs_run_delayed_refs(trans, fs_info, (unsigned long)-1);
+	if (ret) {
+		mutex_unlock(&fs_info->tree_log_mutex);
+		mutex_unlock(&fs_info->reloc_mutex);
+		goto scrub_continue;
+	}
+
+	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
+	if (ret) {
+		mutex_unlock(&fs_info->tree_log_mutex);
+		mutex_unlock(&fs_info->reloc_mutex);
+		goto scrub_continue;
+	}
+
+	/*
 	 * Since fs roots are all committed, we can get a quite accurate
 	 * new_roots. So let's do quota accounting.
 	 */
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/9] btrfs: qgroup: Add quick exit for non-fs extents
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
                   ` (2 preceding siblings ...)
  2017-02-27  7:10 ` [PATCH 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-02-27  7:10 ` [PATCH 5/9] btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function Qu Wenruo
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

For btrfs_qgroup_account_extent(), modify make it exit quicker for
non-fs extents.

This will also reduce the noise in trace_btrfs_qgroup_account_extent
event.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/qgroup.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index e46499cafdbf..bb25fb82b5c2 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1961,6 +1961,33 @@ static int qgroup_update_counters(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+/*
+ * Helper to check if the @roots is a list of fs tree roots
+ * Return 0 for definitely not a fs/subvol tree roots ulist
+ * Return 1 for possible fs/subvol tree roots ulist(including empty)
+ */
+static int maybe_fs_roots(struct ulist *roots)
+{
+	struct ulist_node *unode;
+	struct ulist_iterator uiter;
+
+	/* Empty one, still possible for fs roots */
+	if (!roots || roots->nnodes == 0)
+		return 1;
+
+	ULIST_ITER_INIT(&uiter);
+	unode = ulist_next(roots, &uiter);
+	if (!unode)
+		return 1;
+
+	/*
+	 * If it contains fs tree roots, then it must belongs to fs/subvol
+	 * trees.
+	 * If it contains non-fs tree, it won't be shared to fs/subvol trees.
+	 */
+	return is_fstree(unode->val);
+}
+
 int
 btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans,
 			    struct btrfs_fs_info *fs_info,
@@ -1977,10 +2004,20 @@ btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans,
 	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
 		return 0;
 
-	if (new_roots)
+	if (new_roots) {
+		if (!maybe_fs_roots(new_roots))
+			goto out_free;
 		nr_new_roots = new_roots->nnodes;
-	if (old_roots)
+	}
+	if (old_roots) {
+		if (!maybe_fs_roots(old_roots))
+			goto out_free;
 		nr_old_roots = old_roots->nnodes;
+	}
+
+	/* Quick exit, either not fs tree roots, or won't affect any qgroup */
+	if (nr_old_roots == 0 && nr_new_roots == 0)
+		goto out_free;
 
 	BUG_ON(!fs_info->quota_root);
 
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/9] btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
                   ` (3 preceding siblings ...)
  2017-02-27  7:10 ` [PATCH 4/9] btrfs: qgroup: Add quick exit for non-fs extents Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-02-27  7:10 ` [PATCH 6/9] btrfs: qgroup: Return actually freed bytes for qgroup release or free data Qu Wenruo
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

Quite a lot of qgroup corruption happens due to wrong timing of calling
btrfs_qgroup_prepare_account_extents().

Since the safest timing is calling it just before
btrfs_qgroup_account_extents(), there is no need to separate these 2
function.

Merging them will make code cleaner and less bug prone.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/qgroup.c      | 49 ++++++++++++++++---------------------------------
 fs/btrfs/qgroup.h      |  2 --
 fs/btrfs/transaction.c | 10 ----------
 3 files changed, 16 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index bb25fb82b5c2..4aabeb77d83c 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1449,38 +1449,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
-					 struct btrfs_fs_info *fs_info)
-{
-	struct btrfs_qgroup_extent_record *record;
-	struct btrfs_delayed_ref_root *delayed_refs;
-	struct rb_node *node;
-	u64 qgroup_to_skip;
-	int ret = 0;
-
-	delayed_refs = &trans->transaction->delayed_refs;
-	qgroup_to_skip = delayed_refs->qgroup_to_skip;
-
-	/*
-	 * No need to do lock, since this function will only be called in
-	 * btrfs_commit_transaction().
-	 */
-	node = rb_first(&delayed_refs->dirty_extent_root);
-	while (node) {
-		record = rb_entry(node, struct btrfs_qgroup_extent_record,
-				  node);
-		if (WARN_ON(!record->old_roots))
-			ret = btrfs_find_all_roots(NULL, fs_info,
-					record->bytenr, 0, &record->old_roots);
-		if (ret < 0)
-			break;
-		if (qgroup_to_skip)
-			ulist_del(record->old_roots, qgroup_to_skip, 0);
-		node = rb_next(node);
-	}
-	return ret;
-}
-
 int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
 				struct btrfs_delayed_ref_root *delayed_refs,
 				struct btrfs_qgroup_extent_record *record)
@@ -2097,6 +2065,18 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans,
 
 		if (!ret) {
 			/*
+			 * old roots should be searched when inserting qgroup
+			 * extent record
+			 */
+			if (WARN_ON(!record->old_roots)) {
+				/* Search commit root to find old_roots */
+				ret = btrfs_find_all_roots(NULL, fs_info,
+						record->bytenr, 0,
+						&record->old_roots);
+				if (ret < 0)
+					goto cleanup;
+			}
+			/*
 			 * Use (u64)-1 as time_seq to do special search, which
 			 * doesn't lock tree or delayed_refs and search current
 			 * root. It's safe inside commit_transaction().
@@ -2105,8 +2085,11 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans,
 					record->bytenr, (u64)-1, &new_roots);
 			if (ret < 0)
 				goto cleanup;
-			if (qgroup_to_skip)
+			if (qgroup_to_skip) {
 				ulist_del(new_roots, qgroup_to_skip, 0);
+				ulist_del(record->old_roots, qgroup_to_skip,
+					  0);
+			}
 			ret = btrfs_qgroup_account_extent(trans, fs_info,
 					record->bytenr, record->num_bytes,
 					record->old_roots, new_roots);
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 92608edf0e89..d8debeee45a3 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -90,8 +90,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans,
 int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
 void btrfs_free_qgroup_config(struct btrfs_fs_info *fs_info);
 struct btrfs_delayed_extent_op;
-int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
-					 struct btrfs_fs_info *fs_info);
 /*
  * Inform qgroup to trace one dirty extent, its info is recorded in @record.
  * So qgroup can account it at transaction committing time.
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1ff3ec797356..0f590d5e74b5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1376,9 +1376,6 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
 	ret = commit_fs_roots(trans, fs_info);
 	if (ret)
 		goto out;
-	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-	if (ret < 0)
-		goto out;
 	ret = btrfs_qgroup_account_extents(trans, fs_info);
 	if (ret < 0)
 		goto out;
@@ -2182,13 +2179,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto scrub_continue;
 	}
 
-	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-	if (ret) {
-		mutex_unlock(&fs_info->tree_log_mutex);
-		mutex_unlock(&fs_info->reloc_mutex);
-		goto scrub_continue;
-	}
-
 	/*
 	 * Since fs roots are all committed, we can get a quite accurate
 	 * new_roots. So let's do quota accounting.
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/9] btrfs: qgroup: Return actually freed bytes for qgroup release or free data
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
                   ` (4 preceding siblings ...)
  2017-02-27  7:10 ` [PATCH 5/9] btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-02-27  7:10 ` [PATCH 7/9] btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable Qu Wenruo
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

btrfs_qgroup_release/free_data() only returns 0 or minus error
number(ENOMEM is the only possible error).

This is normally good enough, but sometimes we need the accurate byte
number it freed/released.

Change it to return actually released/freed bytenr number instead of 0
for success.
And slightly modify related extent_changeset structure, since in btrfs
one none-hole data extent won't be larger than 128M, so "unsigned int"
is large enough for the use case.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 fs/btrfs/extent_io.h   | 2 +-
 fs/btrfs/qgroup.c      | 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c35b96633554..16b0d82dcd03 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4287,7 +4287,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
 
 	/* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
 	ret = btrfs_qgroup_reserve_data(inode, start, len);
-	if (ret)
+	if (ret < 0)
 		btrfs_free_reserved_data_space_noquota(inode, start, len);
 	return ret;
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 270d03be290e..b6c3c2a588d4 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -190,7 +190,7 @@ struct extent_buffer {
  */
 struct extent_changeset {
 	/* How many bytes are set/cleared in this operation */
-	u64 bytes_changed;
+	unsigned int bytes_changed;
 
 	/* Changed ranges */
 	struct ulist range_changed;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4aabeb77d83c..4b0f4d2bd63c 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2923,6 +2923,7 @@ static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len,
 		btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info,
 				BTRFS_I(inode)->root->objectid,
 				changeset.bytes_changed);
+	ret = changeset.bytes_changed;
 out:
 	ulist_release(&changeset.range_changed);
 	return ret;
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/9] btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
                   ` (5 preceding siblings ...)
  2017-02-27  7:10 ` [PATCH 6/9] btrfs: qgroup: Return actually freed bytes for qgroup release or free data Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-02-27  7:10 ` [PATCH 8/9] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions Qu Wenruo
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

[BUG]
Under the following case, we can underflow qgroup reserved space.

            Task A                |            Task B
---------------------------------------------------------------
 Quota disabled                   |
 Buffered write                   |
 |- btrfs_check_data_free_space() |
 |  *NO* qgroup space is reserved |
 |  since quota is *DISABLED*     |
 |- All pages are copied to page  |
    cache                         |
                                  | Enable quota
                                  | Quota scan finished
                                  |
                                  | Sync_fs
                                  | |- run_delalloc_range
                                  | |- Write pages
                                  | |- btrfs_finish_ordered_io
                                  |    |- insert_reserved_file_extent
                                  |       |- btrfs_qgroup_release_data()
                                  |          Since no qgroup space is
                                             reserved in Task A, we
                                             underflow qgroup reserved
                                             space
This can be detected by fstest btrfs/104.

[CAUSE]
In insert_reserved_file_extent() we info qgroup to release the @ram_bytes
size of qgroup reserved_space under all case.
And btrfs_qgroup_release_data() will check if qgroup is enabled.

However in above case, the buffered write happens before quota is
enabled, so we don't havee reserved space for that range.

[FIX]
In insert_reserved_file_extent(), we info qgroup to release the acctual
byte number it released.
In above case, since we don't have reserved space, we info qgroup to
release 0 byte, so the problem can be fixed.

And thanks to the @reserved parameter introduced by qgroup rework, and
previous patch to return release bytes, the fix can be as small as less
than 10 lines.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c38391e948d9..aabab731d38e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2045,6 +2045,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
 	struct btrfs_key ins;
+	u64 qg_released;
 	int extent_inserted = 0;
 	int ret;
 
@@ -2100,13 +2101,17 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	ins.objectid = disk_bytenr;
 	ins.offset = disk_num_bytes;
 	ins.type = BTRFS_EXTENT_ITEM_KEY;
-	ret = btrfs_alloc_reserved_file_extent(trans, root->root_key.objectid,
-			btrfs_ino(BTRFS_I(inode)), file_pos, ram_bytes, &ins);
+
 	/*
 	 * Release the reserved range from inode dirty range map, as it is
 	 * already moved into delayed_ref_head
 	 */
-	btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+	ret = btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+	if (ret < 0)
+		goto out;
+	qg_released = ret;
+	ret = btrfs_alloc_reserved_file_extent(trans, root->root_key.objectid,
+			btrfs_ino(BTRFS_I(inode)), file_pos, qg_released, &ins);
 out:
 	btrfs_free_path(path);
 
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 8/9] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
                   ` (6 preceding siblings ...)
  2017-02-27  7:10 ` [PATCH 7/9] btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-02-27  7:10 ` [PATCH 9/9] btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges Qu Wenruo
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it at error
path.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |  6 ++++--
 fs/btrfs/extent-tree.c | 17 ++++++++++++-----
 fs/btrfs/extent_io.h   | 12 ++++++++++++
 fs/btrfs/file.c        | 17 ++++++++++++++---
 fs/btrfs/inode-map.c   |  6 +++++-
 fs/btrfs/inode.c       | 24 ++++++++++++++++++++----
 fs/btrfs/ioctl.c       |  7 ++++++-
 fs/btrfs/qgroup.c      | 33 +++++++++++++++++++++------------
 fs/btrfs/qgroup.h      |  3 ++-
 fs/btrfs/relocation.c  |  5 ++++-
 10 files changed, 100 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ad23a73ac7e7..e8cf7f2a01d6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2686,7 +2686,8 @@ enum btrfs_flush_state {
 	COMMIT_TRANS		=	6,
 };
 
-int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
+int btrfs_check_data_free_space(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
@@ -2705,7 +2706,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
 				      struct btrfs_block_rsv *rsv);
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_space(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len);
 void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 16b0d82dcd03..787f9155b754 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3354,12 +3354,14 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group,
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
 	struct btrfs_root *root = fs_info->tree_root;
 	struct inode *inode = NULL;
+	struct extent_changeset data_reserved;
 	u64 alloc_hint = 0;
 	int dcs = BTRFS_DC_ERROR;
 	u64 num_pages = 0;
 	int retries = 0;
 	int ret = 0;
 
+	extent_changeset_init(&data_reserved);
 	/*
 	 * If this block group is smaller than 100 megs don't bother caching the
 	 * block group.
@@ -3472,7 +3474,7 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group,
 	num_pages *= 16;
 	num_pages *= PAGE_SIZE;
 
-	ret = btrfs_check_data_free_space(inode, 0, num_pages);
+	ret = btrfs_check_data_free_space(inode, &data_reserved, 0, num_pages);
 	if (ret)
 		goto out_put;
 
@@ -3503,6 +3505,7 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group,
 	block_group->disk_cache_state = dcs;
 	spin_unlock(&block_group->lock);
 
+	extent_changeset_release(&data_reserved);
 	return ret;
 }
 
@@ -4271,7 +4274,8 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
  * Will replace old btrfs_check_data_free_space(), but for patch split,
  * add a new function first and then replace it.
  */
-int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
+int btrfs_check_data_free_space(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	int ret;
@@ -4286,9 +4290,11 @@ int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
 		return ret;
 
 	/* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
-	ret = btrfs_qgroup_reserve_data(inode, start, len);
+	ret = btrfs_qgroup_reserve_data(inode, reserved, start, len);
 	if (ret < 0)
 		btrfs_free_reserved_data_space_noquota(inode, start, len);
+	else
+		ret = 0;
 	return ret;
 }
 
@@ -6134,11 +6140,12 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
  * Return 0 for success
  * Return <0 for error(-ENOSPC or -EQUOT)
  */
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
+int btrfs_delalloc_reserve_space(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len)
 {
 	int ret;
 
-	ret = btrfs_check_data_free_space(inode, start, len);
+	ret = btrfs_check_data_free_space(inode, reserved, start, len);
 	if (ret < 0)
 		return ret;
 	ret = btrfs_delalloc_reserve_metadata(inode, len);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b6c3c2a588d4..2e6226eda4cc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -196,6 +196,18 @@ struct extent_changeset {
 	struct ulist range_changed;
 };
 
+static inline void extent_changeset_init(struct extent_changeset *changeset)
+{
+	changeset->bytes_changed = 0;
+	ulist_init(&changeset->range_changed);
+}
+
+static inline void extent_changeset_release(struct extent_changeset *changeset)
+{
+	changeset->bytes_changed = 0;
+	ulist_release(&changeset->range_changed);
+}
+
 static inline void extent_set_compress_type(unsigned long *bio_flags,
 					    int compress_type)
 {
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 18e5146df864..aa2aaeadcb8c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1528,6 +1528,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct page **pages = NULL;
 	struct extent_state *cached_state = NULL;
+	struct extent_changeset data_reserved;
 	u64 release_bytes = 0;
 	u64 lockstart;
 	u64 lockend;
@@ -1538,6 +1539,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 	bool force_page_uptodate = false;
 	bool need_unlock;
 
+	extent_changeset_init(&data_reserved);
+
 	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
 			PAGE_SIZE / (sizeof(struct page *)));
 	nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
@@ -1575,7 +1578,9 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		reserve_bytes = round_up(write_bytes + sector_offset,
 				fs_info->sectorsize);
 
-		ret = btrfs_check_data_free_space(inode, pos, write_bytes);
+		extent_changeset_release(&data_reserved);
+		ret = btrfs_check_data_free_space(inode, &data_reserved, pos,
+						  write_bytes);
 		if (ret < 0) {
 			if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
 						      BTRFS_INODE_PREALLOC)) &&
@@ -1746,6 +1751,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		}
 	}
 
+	extent_changeset_release(&data_reserved);
 	return num_written ? num_written : ret;
 }
 
@@ -2717,6 +2723,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 {
 	struct inode *inode = file_inode(file);
 	struct extent_state *cached_state = NULL;
+	struct extent_changeset data_reserved;
 	struct falloc_range *range;
 	struct falloc_range *tmp;
 	struct list_head reserve_list;
@@ -2731,6 +2738,8 @@ static long btrfs_fallocate(struct file *file, int mode,
 	int blocksize = btrfs_inode_sectorsize(inode);
 	int ret;
 
+	extent_changeset_init(&data_reserved);
+
 	alloc_start = round_down(offset, blocksize);
 	alloc_end = round_up(offset + len, blocksize);
 	cur_offset = alloc_start;
@@ -2848,10 +2857,11 @@ static long btrfs_fallocate(struct file *file, int mode,
 				free_extent_map(em);
 				break;
 			}
-			ret = btrfs_qgroup_reserve_data(inode, cur_offset,
-					last_byte - cur_offset);
+			ret = btrfs_qgroup_reserve_data(inode, &data_reserved,
+					cur_offset, last_byte - cur_offset);
 			if (ret < 0)
 				break;
+			ret = 0;
 		} else {
 			/*
 			 * Do not need to reserve unwritten extent for this
@@ -2919,6 +2929,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 	if (ret != 0)
 		btrfs_free_reserved_data_space(inode, alloc_start,
 				       alloc_end - cur_offset);
+	extent_changeset_release(&data_reserved);
 	return ret;
 }
 
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 3bbb8f095953..1e0b27e198a0 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -400,6 +400,7 @@ int btrfs_save_ino_cache(struct btrfs_root *root,
 	struct btrfs_path *path;
 	struct inode *inode;
 	struct btrfs_block_rsv *rsv;
+	struct extent_changeset data_reserved;
 	u64 num_bytes;
 	u64 alloc_hint = 0;
 	int ret;
@@ -419,6 +420,8 @@ int btrfs_save_ino_cache(struct btrfs_root *root,
 	if (!btrfs_test_opt(fs_info, INODE_MAP_CACHE))
 		return 0;
 
+	extent_changeset_init(&data_reserved);
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -492,7 +495,7 @@ int btrfs_save_ino_cache(struct btrfs_root *root,
 	/* Just to make sure we have enough space */
 	prealloc += 8 * PAGE_SIZE;
 
-	ret = btrfs_delalloc_reserve_space(inode, 0, prealloc);
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, 0, prealloc);
 	if (ret)
 		goto out_put;
 
@@ -516,6 +519,7 @@ int btrfs_save_ino_cache(struct btrfs_root *root,
 	trans->bytes_reserved = num_bytes;
 
 	btrfs_free_path(path);
+	extent_changeset_release(&data_reserved);
 	return ret;
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index aabab731d38e..76d8d0ad733f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1937,12 +1937,15 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	struct btrfs_writepage_fixup *fixup;
 	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
+	struct extent_changeset data_reserved;
 	struct page *page;
 	struct inode *inode;
 	u64 page_start;
 	u64 page_end;
 	int ret;
 
+	extent_changeset_init(&data_reserved);
+
 	fixup = container_of(work, struct btrfs_writepage_fixup, work);
 	page = fixup->page;
 again:
@@ -1974,7 +1977,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		goto again;
 	}
 
-	ret = btrfs_delalloc_reserve_space(inode, page_start,
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
 					   PAGE_SIZE);
 	if (ret) {
 		mapping_set_error(page->mapping, ret);
@@ -1994,6 +1997,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	unlock_page(page);
 	put_page(page);
 	kfree(fixup);
+	extent_changeset_release(&data_reserved);
 }
 
 /*
@@ -4656,6 +4660,7 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
+	struct extent_changeset data_reserved;
 	char *kaddr;
 	u32 blocksize = fs_info->sectorsize;
 	pgoff_t index = from >> PAGE_SHIFT;
@@ -4666,11 +4671,13 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	u64 block_start;
 	u64 block_end;
 
+	extent_changeset_init(&data_reserved);
+
 	if ((offset & (blocksize - 1)) == 0 &&
 	    (!len || ((len & (blocksize - 1)) == 0)))
 		goto out;
 
-	ret = btrfs_delalloc_reserve_space(inode,
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
 			round_down(from, blocksize), blocksize);
 	if (ret)
 		goto out;
@@ -4755,6 +4762,7 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	unlock_page(page);
 	put_page(page);
 out:
+	extent_changeset_release(&data_reserved);
 	return ret;
 }
 
@@ -8583,6 +8591,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct inode *inode = file->f_mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_dio_data dio_data = { 0 };
+	struct extent_changeset data_reserved;
 	loff_t offset = iocb->ki_pos;
 	size_t count = 0;
 	int flags = 0;
@@ -8593,6 +8602,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (check_direct_IO(fs_info, iocb, iter, offset))
 		return 0;
 
+	extent_changeset_init(&data_reserved);
 	inode_dio_begin(inode);
 	smp_mb__after_atomic();
 
@@ -8619,7 +8629,8 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			inode_unlock(inode);
 			relock = true;
 		}
-		ret = btrfs_delalloc_reserve_space(inode, offset, count);
+		ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
+						   offset, count);
 		if (ret)
 			goto out;
 		dio_data.outstanding_extents = count_max_extents(count);
@@ -8676,6 +8687,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (relock)
 		inode_lock(inode);
 
+	extent_changeset_release(&data_reserved);
 	return ret;
 }
 
@@ -8907,6 +8919,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
+	struct extent_changeset data_reserved;
 	char *kaddr;
 	unsigned long zero_start;
 	loff_t size;
@@ -8917,6 +8930,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	u64 page_end;
 	u64 end;
 
+	extent_changeset_init(&data_reserved);
 	reserved_space = PAGE_SIZE;
 
 	sb_start_pagefault(inode->i_sb);
@@ -8932,7 +8946,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * end up waiting indefinitely to get a lock on the page currently
 	 * being processed by btrfs_page_mkwrite() function.
 	 */
-	ret = btrfs_delalloc_reserve_space(inode, page_start,
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
 					   reserved_space);
 	if (!ret) {
 		ret = file_update_time(vma->vm_file);
@@ -9037,6 +9051,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 out_unlock:
 	if (!ret) {
 		sb_end_pagefault(inode->i_sb);
+		extent_changeset_release(&data_reserved);
 		return VM_FAULT_LOCKED;
 	}
 	unlock_page(page);
@@ -9044,6 +9059,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	btrfs_delalloc_release_space(inode, page_start, reserved_space);
 out_noreserve:
 	sb_end_pagefault(inode->i_sb);
+	extent_changeset_release(&data_reserved);
 	return ret;
 }
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d8539979b44f..8bcda584d6cb 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1127,15 +1127,18 @@ static int cluster_pages_for_defrag(struct inode *inode,
 	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
 	struct extent_io_tree *tree;
+	struct extent_changeset data_reserved;
 	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
 
+	extent_changeset_init(&data_reserved);
+
 	file_end = (isize - 1) >> PAGE_SHIFT;
 	if (!isize || start_index > file_end)
 		return 0;
 
 	page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1);
 
-	ret = btrfs_delalloc_reserve_space(inode,
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
 			start_index << PAGE_SHIFT,
 			page_cnt << PAGE_SHIFT);
 	if (ret)
@@ -1247,6 +1250,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 		unlock_page(pages[i]);
 		put_page(pages[i]);
 	}
+	extent_changeset_release(&data_reserved);
 	return i_done;
 out:
 	for (i = 0; i < i_done; i++) {
@@ -1256,6 +1260,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 	btrfs_delalloc_release_space(inode,
 			start_index << PAGE_SHIFT,
 			page_cnt << PAGE_SHIFT);
+	extent_changeset_release(&data_reserved);
 	return ret;
 
 }
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4b0f4d2bd63c..5efbffef3dc3 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2861,43 +2861,52 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
  * Return <0 for error (including -EQUOT)
  *
  * NOTE: this function may sleep for memory allocation.
+ *       if btrfs_qgroup_reserve_data() is called multiple times with
+ *       same @reserved, caller must ensure when error happens it's OK
+ *       to free *ALL* reserved space.
  */
-int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len)
+int btrfs_qgroup_reserve_data(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct extent_changeset changeset;
 	struct ulist_node *unode;
 	struct ulist_iterator uiter;
+	u64 orig_reserved;
+	u64 to_reserve;
 	int ret;
 
 	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &root->fs_info->flags) ||
 	    !is_fstree(root->objectid) || len == 0)
 		return 0;
 
-	changeset.bytes_changed = 0;
-	ulist_init(&changeset.range_changed);
+	/* @reserved parameter is mandatory for qgroup */
+	if (WARN_ON(!reserved))
+		return -EINVAL;
+	/* Record already reserved space */
+	orig_reserved = reserved->bytes_changed;
 	ret = set_record_extent_bits(&BTRFS_I(inode)->io_tree, start,
-			start + len -1, EXTENT_QGROUP_RESERVED, &changeset);
+			start + len -1, EXTENT_QGROUP_RESERVED, reserved);
+
+	/* Newly reserved space */
+	to_reserve = reserved->bytes_changed - orig_reserved;
 	trace_btrfs_qgroup_reserve_data(inode, start, len,
-					changeset.bytes_changed,
-					QGROUP_RESERVE);
+					to_reserve, QGROUP_RESERVE);
 	if (ret < 0)
 		goto cleanup;
-	ret = qgroup_reserve(root, changeset.bytes_changed, true);
+	ret = qgroup_reserve(root, to_reserve, true);
 	if (ret < 0)
 		goto cleanup;
 
-	ulist_release(&changeset.range_changed);
 	return ret;
 
 cleanup:
-	/* cleanup already reserved ranges */
+	/* cleanup *ALL* already reserved ranges */
 	ULIST_ITER_INIT(&uiter);
-	while ((unode = ulist_next(&changeset.range_changed, &uiter)))
+	while ((unode = ulist_next(&reserved->range_changed, &uiter)))
 		clear_extent_bit(&BTRFS_I(inode)->io_tree, unode->val,
 				 unode->aux, EXTENT_QGROUP_RESERVED, 0, 0, NULL,
 				 GFP_NOFS);
-	ulist_release(&changeset.range_changed);
+	extent_changeset_release(reserved);
 	return ret;
 }
 
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index d8debeee45a3..cc7a27a8af27 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -198,7 +198,8 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
 #endif
 
 /* New io_tree based accurate qgroup reserve API */
-int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len);
+int btrfs_qgroup_reserve_data(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len);
 int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len);
 int btrfs_qgroup_free_data(struct inode *inode, u64 start, u64 len);
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index ddbde0f08365..3439376c2a78 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3093,11 +3093,13 @@ int prealloc_file_extent_cluster(struct inode *inode,
 	u64 prealloc_start = cluster->start - offset;
 	u64 prealloc_end = cluster->end - offset;
 	u64 cur_offset;
+	struct extent_changeset data_reserved;
 
 	BUG_ON(cluster->start != cluster->boundary[0]);
+	extent_changeset_init(&data_reserved);
 	inode_lock(inode);
 
-	ret = btrfs_check_data_free_space(inode, prealloc_start,
+	ret = btrfs_check_data_free_space(inode, &data_reserved, prealloc_start,
 					  prealloc_end + 1 - prealloc_start);
 	if (ret)
 		goto out;
@@ -3129,6 +3131,7 @@ int prealloc_file_extent_cluster(struct inode *inode,
 				       prealloc_end + 1 - cur_offset);
 out:
 	inode_unlock(inode);
+	extent_changeset_release(&data_reserved);
 	return ret;
 }
 
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 9/9] btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
                   ` (7 preceding siblings ...)
  2017-02-27  7:10 ` [PATCH 8/9] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions Qu Wenruo
@ 2017-02-27  7:10 ` Qu Wenruo
  2017-03-06  8:08 ` [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
  2017-03-07 15:13 ` David Sterba
  10 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-02-27  7:10 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

[BUG]
For the following case, btrfs can underflow qgroup reserved space
at error path:
(Page size 4K, function name without "btrfs_" prefix)

         Task A                  |             Task B
----------------------------------------------------------------------
Buffered_write [0, 2K)           |
|- check_data_free_space()       |
|  |- qgroup_reserve_data()      |
|     Range aligned to page      |
|     range [0, 4K)          <<< |
|     4K bytes reserved      <<< |
|- copy pages to page cache      |
                                 | Buffered_write [2K, 4K)
                                 | |- check_data_free_space()
                                 | |  |- qgroup_reserved_data()
                                 | |     Range alinged to page
                                 | |     range [0, 4K)
                                 | |     Already reserved by A <<<
                                 | |     0 bytes reserved      <<<
                                 | |- delalloc_reserve_metadata()
                                 | |  And it *FAILED* (Maybe EQUOTA)
                                 | |- free_reserved_data_space()
                                      |- qgroup_free_data()
                                         Range aligned to page range
                                         [0, 4K)
                                         Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at write back time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h       |  6 +++--
 fs/btrfs/extent-tree.c | 12 +++++----
 fs/btrfs/file.c        | 29 +++++++++++---------
 fs/btrfs/inode.c       | 29 ++++++++++----------
 fs/btrfs/ioctl.c       |  4 +--
 fs/btrfs/qgroup.c      | 72 ++++++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/qgroup.h      |  3 ++-
 fs/btrfs/relocation.c  |  8 +++---
 8 files changed, 117 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e8cf7f2a01d6..b2ae816ea39c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2689,7 +2689,10 @@ enum btrfs_flush_state {
 int btrfs_check_data_free_space(struct inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
-void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
+void btrfs_free_reserved_data_space(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len);
+void btrfs_delalloc_release_space(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
 					    u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
@@ -2708,7 +2711,6 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
 int btrfs_delalloc_reserve_space(struct inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
 					      unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 787f9155b754..b90c32b75b16 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4335,7 +4335,8 @@ void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
  * This one will handle the per-inode data rsv map for accurate reserved
  * space framework.
  */
-void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+void btrfs_free_reserved_data_space(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 
@@ -4345,7 +4346,7 @@ void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
 	start = round_down(start, root->fs_info->sectorsize);
 
 	btrfs_free_reserved_data_space_noquota(inode, start, len);
-	btrfs_qgroup_free_data(inode, start, len);
+	btrfs_qgroup_free_data(inode, reserved, start, len);
 }
 
 static void force_metadata_allocation(struct btrfs_fs_info *info)
@@ -6150,7 +6151,7 @@ int btrfs_delalloc_reserve_space(struct inode *inode,
 		return ret;
 	ret = btrfs_delalloc_reserve_metadata(inode, len);
 	if (ret < 0)
-		btrfs_free_reserved_data_space(inode, start, len);
+		btrfs_free_reserved_data_space(inode, reserved, start, len);
 	return ret;
 }
 
@@ -6169,10 +6170,11 @@ int btrfs_delalloc_reserve_space(struct inode *inode,
  * list if there are no delalloc bytes left.
  * Also it will handle the qgroup reserved space.
  */
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len)
+void btrfs_delalloc_release_space(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len)
 {
 	btrfs_delalloc_release_metadata(inode, len);
-	btrfs_free_reserved_data_space(inode, start, len);
+	btrfs_free_reserved_data_space(inode, reserved, start, len);
 }
 
 static int update_block_group(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index aa2aaeadcb8c..53240eff0727 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1607,8 +1607,9 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
 		if (ret) {
 			if (!only_release_metadata)
-				btrfs_free_reserved_data_space(inode, pos,
-							       write_bytes);
+				btrfs_free_reserved_data_space(inode,
+						&data_reserved, pos,
+						write_bytes);
 			else
 				btrfs_end_write_no_snapshoting(root);
 			break;
@@ -1690,8 +1691,9 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 				__pos = round_down(pos,
 						   fs_info->sectorsize) +
 					(dirty_pages << PAGE_SHIFT);
-				btrfs_delalloc_release_space(inode, __pos,
-							     release_bytes);
+				btrfs_delalloc_release_space(inode,
+						&data_reserved, __pos,
+						release_bytes);
 			}
 		}
 
@@ -1745,9 +1747,9 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 			btrfs_end_write_no_snapshoting(root);
 			btrfs_delalloc_release_metadata(inode, release_bytes);
 		} else {
-			btrfs_delalloc_release_space(inode,
-						round_down(pos, fs_info->sectorsize),
-						release_bytes);
+			btrfs_delalloc_release_space(inode, &data_reserved,
+					round_down(pos, fs_info->sectorsize),
+					release_bytes);
 		}
 	}
 
@@ -2868,8 +2870,8 @@ static long btrfs_fallocate(struct file *file, int mode,
 			 * range, free reserved data space first, otherwise
 			 * it'll result in false ENOSPC error.
 			 */
-			btrfs_free_reserved_data_space(inode, cur_offset,
-				last_byte - cur_offset);
+			btrfs_free_reserved_data_space(inode, &data_reserved,
+					cur_offset, last_byte - cur_offset);
 		}
 		free_extent_map(em);
 		cur_offset = last_byte;
@@ -2888,8 +2890,9 @@ static long btrfs_fallocate(struct file *file, int mode,
 					range->len, 1 << inode->i_blkbits,
 					offset + len, &alloc_hint);
 		else
-			btrfs_free_reserved_data_space(inode, range->start,
-						       range->len);
+			btrfs_free_reserved_data_space(inode,
+					&data_reserved, range->start,
+					range->len);
 		list_del(&range->list);
 		kfree(range);
 	}
@@ -2927,8 +2930,8 @@ static long btrfs_fallocate(struct file *file, int mode,
 	inode_unlock(inode);
 	/* Let go of our reservation. */
 	if (ret != 0)
-		btrfs_free_reserved_data_space(inode, alloc_start,
-				       alloc_end - cur_offset);
+		btrfs_free_reserved_data_space(inode, &data_reserved,
+				alloc_start, alloc_end - cur_offset);
 	extent_changeset_release(&data_reserved);
 	return ret;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 76d8d0ad733f..3c1ae9b001a8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -325,7 +325,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
 	 * And at reserve time, it's always aligned to page size, so
 	 * just free one page here.
 	 */
-	btrfs_qgroup_free_data(inode, 0, PAGE_SIZE);
+	btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE);
 	btrfs_free_path(path);
 	btrfs_end_transaction(trans);
 	return ret;
@@ -2829,7 +2829,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		 * space for NOCOW range.
 		 * As NOCOW won't cause a new delayed ref, just free the space
 		 */
-		btrfs_qgroup_free_data(inode, ordered_extent->file_offset,
+		btrfs_qgroup_free_data(inode, NULL, ordered_extent->file_offset,
 				       ordered_extent->len);
 		btrfs_ordered_update_i_size(inode, 0, ordered_extent);
 		if (nolock)
@@ -4685,7 +4685,7 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 again:
 	page = find_or_create_page(mapping, index, mask);
 	if (!page) {
-		btrfs_delalloc_release_space(inode,
+		btrfs_delalloc_release_space(inode, &data_reserved,
 				round_down(from, blocksize),
 				blocksize);
 		ret = -ENOMEM;
@@ -4757,7 +4757,7 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 
 out_unlock:
 	if (ret)
-		btrfs_delalloc_release_space(inode, block_start,
+		btrfs_delalloc_release_space(inode, &data_reserved, block_start,
 					     blocksize);
 	unlock_page(page);
 	put_page(page);
@@ -5156,7 +5156,7 @@ static void evict_inode_truncate_pages(struct inode *inode)
 		 * Note, end is the bytenr of last byte, so we need + 1 here.
 		 */
 		if (state->state & EXTENT_DELALLOC)
-			btrfs_qgroup_free_data(inode, start, end - start + 1);
+			btrfs_qgroup_free_data(inode, NULL, start, end - start + 1);
 
 		clear_extent_bit(io_tree, start, end,
 				 EXTENT_LOCKED | EXTENT_DIRTY |
@@ -8662,8 +8662,8 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		current->journal_info = NULL;
 		if (ret < 0 && ret != -EIOCBQUEUED) {
 			if (dio_data.reserve)
-				btrfs_delalloc_release_space(inode, offset,
-							     dio_data.reserve);
+				btrfs_delalloc_release_space(inode, &data_reserved,
+					offset, dio_data.reserve);
 			/*
 			 * On error we might have left some ordered extents
 			 * without submitting corresponding bios for them, so
@@ -8678,8 +8678,8 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 					dio_data.unsubmitted_oe_range_start,
 					0);
 		} else if (ret >= 0 && (size_t)ret < count)
-			btrfs_delalloc_release_space(inode, offset,
-						     count - (size_t)ret);
+			btrfs_delalloc_release_space(inode, &data_reserved,
+					offset, count - (size_t)ret);
 	}
 out:
 	if (wakeup)
@@ -8877,7 +8877,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 *    free the entire extent.
 	 */
 	if (PageDirty(page))
-		btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
+		btrfs_qgroup_free_data(inode, NULL, page_start, PAGE_SIZE);
 	if (!inode_evicting) {
 		clear_extent_bit(tree, page_start, page_end,
 				 EXTENT_LOCKED | EXTENT_DIRTY |
@@ -8999,8 +8999,8 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 			spin_lock(&BTRFS_I(inode)->lock);
 			BTRFS_I(inode)->outstanding_extents++;
 			spin_unlock(&BTRFS_I(inode)->lock);
-			btrfs_delalloc_release_space(inode, page_start,
-						PAGE_SIZE - reserved_space);
+			btrfs_delalloc_release_space(inode, &data_reserved,
+					page_start, PAGE_SIZE - reserved_space);
 		}
 	}
 
@@ -9056,7 +9056,8 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	unlock_page(page);
 out:
-	btrfs_delalloc_release_space(inode, page_start, reserved_space);
+	btrfs_delalloc_release_space(inode, &data_reserved, page_start,
+				     reserved_space);
 out_noreserve:
 	sb_end_pagefault(inode->i_sb);
 	extent_changeset_release(&data_reserved);
@@ -10411,7 +10412,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 			btrfs_end_transaction(trans);
 	}
 	if (cur_offset < end)
-		btrfs_free_reserved_data_space(inode, cur_offset,
+		btrfs_free_reserved_data_space(inode, NULL, cur_offset,
 			end - cur_offset + 1);
 	return ret;
 }
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 8bcda584d6cb..94d0c84ea7e5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1229,7 +1229,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->outstanding_extents++;
 		spin_unlock(&BTRFS_I(inode)->lock);
-		btrfs_delalloc_release_space(inode,
+		btrfs_delalloc_release_space(inode, &data_reserved,
 				start_index << PAGE_SHIFT,
 				(page_cnt - i_done) << PAGE_SHIFT);
 	}
@@ -1257,7 +1257,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 		unlock_page(pages[i]);
 		put_page(pages[i]);
 	}
-	btrfs_delalloc_release_space(inode,
+	btrfs_delalloc_release_space(inode, &data_reserved,
 			start_index << PAGE_SHIFT,
 			page_cnt << PAGE_SHIFT);
 	extent_changeset_release(&data_reserved);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 5efbffef3dc3..bb7e42f11566 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2910,13 +2910,72 @@ int btrfs_qgroup_reserve_data(struct inode *inode,
 	return ret;
 }
 
-static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len,
-				       int free)
+/* Free ranges specified by @reserved, normally in error path */
+static int qgroup_free_reserved_data(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct ulist_node *unode;
+	struct ulist_iterator uiter;
+	struct extent_changeset changeset;
+	int freed = 0;
+	int ret;
+
+	extent_changeset_init(&changeset);
+	len = round_up(start + len, root->fs_info->sectorsize);
+	start = round_down(start, root->fs_info->sectorsize);
+
+	ULIST_ITER_INIT(&uiter);
+	while ((unode = ulist_next(&reserved->range_changed, &uiter))) {
+		u64 range_start = unode->val;
+		/* unode->aux is the inclusive end */
+		u64 range_len = unode->aux - range_start + 1;
+		u64 free_start;
+		u64 free_len;
+
+		extent_changeset_release(&changeset);
+
+		/* Only free range in range [start, start + len) */
+		if (range_start >= start + len ||
+		    range_start + range_len <= start)
+			continue;
+		free_start = max(range_start, start);
+		free_len = min(start + len, range_start + range_len) -
+			   free_start;
+		/*
+		 * TODO: To also modify reserved->ranges_reserved to reflect
+		 * the modification.
+		 *
+		 * However as long as we free qgroup reserved according to
+		 * EXTENT_QGROUP_RESERVED, we won't double free.
+		 * So not need to rush.
+		 */
+		ret = clear_record_extent_bits(&BTRFS_I(inode)->io_failure_tree,
+				free_start, free_start + free_len - 1,
+				EXTENT_QGROUP_RESERVED, &changeset);
+		if (ret < 0)
+			goto out;
+		freed += changeset.bytes_changed;
+	}
+	btrfs_qgroup_free_refroot(root->fs_info, root->objectid, freed);
+	ret = freed;
+out:
+	extent_changeset_release(&changeset);
+	return ret;
+}
+
+static int __btrfs_qgroup_release_data(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len,
+			int free)
 {
 	struct extent_changeset changeset;
 	int trace_op = QGROUP_RELEASE;
 	int ret;
 
+	/* In release case, we shouldn't have @reserved */
+	WARN_ON(!free && reserved);
+	if (free && reserved)
+		return qgroup_free_reserved_data(inode, reserved, start, len);
 	changeset.bytes_changed = 0;
 	ulist_init(&changeset.range_changed);
 	ret = clear_record_extent_bits(&BTRFS_I(inode)->io_tree, start, 
@@ -2943,14 +3002,17 @@ static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len,
  *
  * Should be called when a range of pages get invalidated before reaching disk.
  * Or for error cleanup case.
+ * if @reserved is given, only reserved range in [@start, @start + @len) will
+ * be freed.
  *
  * For data written to disk, use btrfs_qgroup_release_data().
  *
  * NOTE: This function may sleep for memory allocation.
  */
-int btrfs_qgroup_free_data(struct inode *inode, u64 start, u64 len)
+int btrfs_qgroup_free_data(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len)
 {
-	return __btrfs_qgroup_release_data(inode, start, len, 1);
+	return __btrfs_qgroup_release_data(inode, reserved, start, len, 1);
 }
 
 /*
@@ -2970,7 +3032,7 @@ int btrfs_qgroup_free_data(struct inode *inode, u64 start, u64 len)
  */
 int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len)
 {
-	return __btrfs_qgroup_release_data(inode, start, len, 0);
+	return __btrfs_qgroup_release_data(inode, NULL, start, len, 0);
 }
 
 int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index cc7a27a8af27..b59464e7f988 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -201,7 +201,8 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
 int btrfs_qgroup_reserve_data(struct inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len);
 int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len);
-int btrfs_qgroup_free_data(struct inode *inode, u64 start, u64 len);
+int btrfs_qgroup_free_data(struct inode *inode,
+			struct extent_changeset *reserved, u64 start, u64 len);
 
 int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 			      bool enforce);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3439376c2a78..7de3186dafcf 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3115,8 +3115,8 @@ int prealloc_file_extent_cluster(struct inode *inode,
 		lock_extent(&BTRFS_I(inode)->io_tree, start, end);
 		num_bytes = end + 1 - start;
 		if (cur_offset < start)
-			btrfs_free_reserved_data_space(inode, cur_offset,
-					start - cur_offset);
+			btrfs_free_reserved_data_space(inode, &data_reserved,
+					cur_offset, start - cur_offset);
 		ret = btrfs_prealloc_file_range(inode, 0, start,
 						num_bytes, num_bytes,
 						end + 1, &alloc_hint);
@@ -3127,8 +3127,8 @@ int prealloc_file_extent_cluster(struct inode *inode,
 		nr++;
 	}
 	if (cur_offset < prealloc_end)
-		btrfs_free_reserved_data_space(inode, cur_offset,
-				       prealloc_end + 1 - cur_offset);
+		btrfs_free_reserved_data_space(inode, &data_reserved,
+				cur_offset, prealloc_end + 1 - cur_offset);
 out:
 	inode_unlock(inode);
 	extent_changeset_release(&data_reserved);
-- 
2.11.1




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
                   ` (8 preceding siblings ...)
  2017-02-27  7:10 ` [PATCH 9/9] btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges Qu Wenruo
@ 2017-03-06  8:08 ` Qu Wenruo
  2017-03-06 16:44   ` David Sterba
  2017-03-08 14:24   ` Jeff Mahoney
  2017-03-07 15:13 ` David Sterba
  10 siblings, 2 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-03-06  8:08 UTC (permalink / raw)
  To: linux-btrfs, dsterba, clm

Any response?

These patches are already here for at least 2 kernel releases.
And are all bug fixes, and fix bugs that are already reported.

I didn't see any reason why it should be delayed for so long time.

Thanks,
Qu

At 02/27/2017 03:10 PM, Qu Wenruo wrote:
> Pull request can be fetched from my github:
> https://github.com/adam900710/linux.git qgroup_fixes_for_4.11
>
> The base is 6288d6eabc7505f42dda34a2c2962f91914be3a4.
> Author: Liu Bo <bo.li.liu@oracle.com>
> Date:   Tue Feb 21 12:12:58 2017 -0800
>
>     Btrfs: use the correct type when creating cow dio extent
>
> These patches are already in mail list for almost 3 months.
> For example the tracepoint patch is last updated at 2016/12/09.
>
> With this patchset btrfs can pass all xfstest qgroup tests now.
>
> This pull request should be good for 4.11 as they are all bug fixes and
> has been delayed for several times.
>
> I don't know if these patchset will be delayed again if someone wants to
> cleanup something else, and cause rebase conflicts to delay such fixes.
> But I must say, that's very frustrating to see bug fixes just get dropped
> again and again just due to new cleanups and lack of reviewers.
>
> Despite all these pities, this pull request includes:
> 1) Fix for inode_cache mount option
>    Although no one really cares inode_cache mount option, it will screw
>    qgroup for a long time.
>    Not only it will screw up qgroup test uses golden output, but also
>    new test cases use btrfsck to verify qgroup.
>
> 2) Fix for btrfs/104 kernel warning
>    This is caused by quota enabling with dirty buffers not written to
>    disc.
>
>    Fixed by checking EXTENT_QGROUP_RESERVED flag other than just
>    decreasing qgroup->reserved.
>
> 3) Fix qgroup underflow caused by freeing ranges not reserved by caller
>    Mainly exposed by Chandan on PPC64.
>
>    It's possible that buffered write is blocked by reserving metadata,
>    and in error path we will release reserved space for both data and
>    metadata.
>
>    However the reserved data can already be reserved by another process
>    writing the same page.
>
>    In that case, data reserved space can be freed by two process, one
>    for error path, one for normal write routine, causing underflow.
>
>    Fixed by checking if that data range is reserved by ourselves and
>    only free it if it's reserved by ourselves.
>
> Update since 2016/12/09:
>   Rebased to latest for-linux-4.11.
>
>   Add missing reviewed-by and tested-by tags.
>
>   Add more comment for btrfs_qgroup_reserve_data() for error handling.
>
>   Add more comment for qgroup_free_reserved_data() for later
>   enhancement (not function enhancement).
>
> Qu Wenruo (9):
>   btrfs: qgroup: Add trace point for qgroup reserved space
>   btrfs: qgroup: Re-arrange tracepoint timing to co-operate with
>     reserved space tracepoint
>   btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount
>     option
>   btrfs: qgroup: Add quick exit for non-fs extents
>   btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
>   btrfs: qgroup: Return actually freed bytes for qgroup release or free
>     data
>   btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered
>     write and quota enable
>   btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
>   btrfs: qgroup: Fix qgroup reserved space underflow by only freeing
>     reserved ranges
>
>  fs/btrfs/ctree.h             |  12 ++-
>  fs/btrfs/extent-tree.c       |  31 +++---
>  fs/btrfs/extent_io.h         |  14 ++-
>  fs/btrfs/file.c              |  46 +++++----
>  fs/btrfs/inode-map.c         |   6 +-
>  fs/btrfs/inode.c             |  64 +++++++++----
>  fs/btrfs/ioctl.c             |  11 ++-
>  fs/btrfs/qgroup.c            | 221 ++++++++++++++++++++++++++++++++-----------
>  fs/btrfs/qgroup.h            |  14 +--
>  fs/btrfs/relocation.c        |  13 ++-
>  fs/btrfs/transaction.c       |  21 ++--
>  include/trace/events/btrfs.h |  43 +++++++++
>  12 files changed, 358 insertions(+), 138 deletions(-)
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
  2017-03-06  8:08 ` [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
@ 2017-03-06 16:44   ` David Sterba
  2017-03-06 17:56     ` Chris Mason
  2017-03-08 14:24   ` Jeff Mahoney
  1 sibling, 1 reply; 25+ messages in thread
From: David Sterba @ 2017-03-06 16:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: clm, linux-btrfs

On Mon, Mar 06, 2017 at 04:08:41PM +0800, Qu Wenruo wrote:
> Any response?
> 
> These patches are already here for at least 2 kernel releases.
> And are all bug fixes, and fix bugs that are already reported.
> 
> I didn't see any reason why it should be delayed for so long time.

The reason is not enough review for the whole series. The delay is
unfortunate, same as almost zero people capable & willing to review it,
despite your efforts to keep the series up to date with current code.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
  2017-03-06 16:44   ` David Sterba
@ 2017-03-06 17:56     ` Chris Mason
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Mason @ 2017-03-06 17:56 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

On 03/06/2017 11:44 AM, David Sterba wrote:
> On Mon, Mar 06, 2017 at 04:08:41PM +0800, Qu Wenruo wrote:
>> Any response?
>>
>> These patches are already here for at least 2 kernel releases.
>> And are all bug fixes, and fix bugs that are already reported.
>>
>> I didn't see any reason why it should be delayed for so long time.
>
> The reason is not enough review for the whole series. The delay is
> unfortunate, same as almost zero people capable & willing to review it,
> despite your efforts to keep the series up to date with current code.
>

I've been hesitated to take more series of qgroup fixes because past 
sets have ended up causing problems down the line.  I'm queuing it for 
v4.12 while I collect tests and reviews though, it's definitely not fair 
for you to have to keep rebasing it without suggestions on what to fix.

Thanks!

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
  2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
                   ` (9 preceding siblings ...)
  2017-03-06  8:08 ` [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
@ 2017-03-07 15:13 ` David Sterba
  2017-03-08  0:28   ` Qu Wenruo
  10 siblings, 1 reply; 25+ messages in thread
From: David Sterba @ 2017-03-07 15:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba, clm

On Mon, Feb 27, 2017 at 03:10:30PM +0800, Qu Wenruo wrote:
> Pull request can be fetched from my github:
> https://github.com/adam900710/linux.git qgroup_fixes_for_4.11 
> 
> The base is 6288d6eabc7505f42dda34a2c2962f91914be3a4.
> Author: Liu Bo <bo.li.liu@oracle.com>
> Date:   Tue Feb 21 12:12:58 2017 -0800
> 
>     Btrfs: use the correct type when creating cow dio extent
> 
> These patches are already in mail list for almost 3 months.
> For example the tracepoint patch is last updated at 2016/12/09.
> 
> With this patchset btrfs can pass all xfstest qgroup tests now.
> 
> This pull request should be good for 4.11 as they are all bug fixes and
> has been delayed for several times.
> 
> I don't know if these patchset will be delayed again if someone wants to
> cleanup something else, and cause rebase conflicts to delay such fixes.

I have rebased the branch on top of the cleanups and added it as
separate branch to for-next. For the next dev cycle I'll delay merging the
intrusive cleanups until the end.

> But I must say, that's very frustrating to see bug fixes just get dropped
> again and again just due to new cleanups and lack of reviewers.

That's life of a kernel developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space
  2017-02-27  7:10 ` [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space Qu Wenruo
@ 2017-03-07 15:19   ` David Sterba
  2017-03-07 16:30   ` Jeff Mahoney
  1 sibling, 0 replies; 25+ messages in thread
From: David Sterba @ 2017-03-07 15:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba, clm

On Mon, Feb 27, 2017 at 03:10:31PM +0800, Qu Wenruo wrote:
> +TRACE_EVENT(qgroup_update_reserve,
...
> +	TP_printk_btrfs("qgid = %llu, cur_reserved = %llu, diff = %lld",
> +		__entry->qgid, __entry->cur_reserved, __entry->diff)

> +);
> +
> +TRACE_EVENT(qgroup_meta_reserve,
...
> +	TP_printk_btrfs("refroot = %llu, diff = %lld",
> +		__entry->refroot, __entry->diff)
> +);

Please update the format strings,
https://btrfs.wiki.kernel.org/index.php/Development_notes#Tracepoints

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space
  2017-02-27  7:10 ` [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space Qu Wenruo
  2017-03-07 15:19   ` David Sterba
@ 2017-03-07 16:30   ` Jeff Mahoney
  2017-03-09  6:02     ` Qu Wenruo
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Mahoney @ 2017-03-07 16:30 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, dsterba, clm


[-- Attachment #1.1: Type: text/plain, Size: 4988 bytes --]

On 2/27/17 2:10 AM, Qu Wenruo wrote:
> Introduce the following trace points:
> qgroup_update_reserve
> qgroup_meta_reserve
>
> These trace points are handy to trace qgroup reserve space related
> problems.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/qgroup.c            | 15 +++++++++++++++
>  include/trace/events/btrfs.h | 43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index a5da750c1087..b303d4794026 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1075,6 +1075,8 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info *fs_info,
>  	qgroup->excl += sign * num_bytes;
>  	qgroup->excl_cmpr += sign * num_bytes;
>  	if (sign > 0) {
> +		trace_qgroup_update_reserve(fs_info, qgroup->qgroupid,
> +					    qgroup->reserved, -(s64)num_bytes);
>  		if (WARN_ON(qgroup->reserved < num_bytes))
>  			report_reserved_underflow(fs_info, qgroup, num_bytes);
>  		else
> @@ -1100,6 +1102,9 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info *fs_info,
>  		WARN_ON(sign < 0 && qgroup->excl < num_bytes);
>  		qgroup->excl += sign * num_bytes;
>  		if (sign > 0) {
> +			trace_qgroup_update_reserve(fs_info, qgroup->qgroupid,
> +						    qgroup->reserved,
> +						    -(s64)num_bytes);
>  			if (WARN_ON(qgroup->reserved < num_bytes))
>  				report_reserved_underflow(fs_info, qgroup,
>  							  num_bytes);
> @@ -2424,6 +2429,8 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce)
>  
>  		qg = unode_aux_to_qgroup(unode);
>  
> +		trace_qgroup_update_reserve(fs_info, qg->qgroupid,
> +					    qg->reserved, num_bytes);
>  		qg->reserved += num_bytes;
>  	}
>  
> @@ -2469,6 +2476,8 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
>  
>  		qg = unode_aux_to_qgroup(unode);
>  
> +		trace_qgroup_update_reserve(fs_info, qg->qgroupid,
> +					    qg->reserved, -(s64)num_bytes);
>  		if (WARN_ON(qg->reserved < num_bytes))
>  			report_reserved_underflow(fs_info, qg, num_bytes);
>  		else
> @@ -2945,6 +2954,8 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>  		return 0;
>  
>  	BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
> +	trace_qgroup_meta_reserve(fs_info, root->objectid,
> +				  (s64)num_bytes);
>  	ret = qgroup_reserve(root, num_bytes, enforce);
>  	if (ret < 0)
>  		return ret;
> @@ -2964,6 +2975,8 @@ void btrfs_qgroup_free_meta_all(struct btrfs_root *root)
>  	reserved = atomic_xchg(&root->qgroup_meta_rsv, 0);
>  	if (reserved == 0)
>  		return;
> +	trace_qgroup_meta_reserve(root->fs_info, root->objectid,
> +				  -(s64)reserved);
>  	btrfs_qgroup_free_refroot(fs_info, root->objectid, reserved);
>  }
>  
> @@ -2978,6 +2991,8 @@ void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes)
>  	BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
>  	WARN_ON(atomic_read(&root->qgroup_meta_rsv) < num_bytes);
>  	atomic_sub(num_bytes, &root->qgroup_meta_rsv);
> +	trace_qgroup_meta_reserve(root->fs_info, root->objectid,
> +				  -(s64)num_bytes);
>  	btrfs_qgroup_free_refroot(fs_info, root->objectid, num_bytes);
>  }
>  
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index a3c3cab643a9..2d5799d8a4a2 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -1475,6 +1475,49 @@ TRACE_EVENT(qgroup_update_counters,
>  		  __entry->cur_new_count)
>  );
>  
> +TRACE_EVENT(qgroup_update_reserve,
> +
> +	TP_PROTO(struct btrfs_fs_info *fs_info, u64 qgid, u64 cur_reserved,
> +		 s64 diff),
> +
> +	TP_ARGS(fs_info, qgid, cur_reserved, diff),
> +
> +	TP_STRUCT__entry_btrfs(
> +		__field(	u64,	qgid			)
> +		__field(	u64,	cur_reserved		)
> +		__field(	s64,	diff			)
> +	),
> +
> +	TP_fast_assign_btrfs(fs_info,
> +		__entry->qgid		= qgid;
> +		__entry->cur_reserved	= cur_reserved;
> +		__entry->diff		= diff;
> +	),
> +
> +	TP_printk_btrfs("qgid = %llu, cur_reserved = %llu, diff = %lld",
> +		__entry->qgid, __entry->cur_reserved, __entry->diff)
> +);
> +

This is always called using a qgroup pointer.  Why not just pass it and
let the the assignment clause grab the values?

> +TRACE_EVENT(qgroup_meta_reserve,
> +
> +	TP_PROTO(struct btrfs_fs_info *fs_info, u64 refroot, s64 diff),
> +
> +	TP_ARGS(fs_info, refroot, diff),
> +
> +	TP_STRUCT__entry_btrfs(
> +		__field(	u64,	refroot			)
> +		__field(	s64,	diff			)
> +	),
> +
> +	TP_fast_assign_btrfs(fs_info,
> +		__entry->refroot	= refroot;
> +		__entry->diff		= diff;
> +	),
> +
> +	TP_printk_btrfs("refroot = %llu, diff = %lld",
> +		__entry->refroot, __entry->diff)
> +);

Same here, but with btrfs_root * instead.

-Jeff

> +
>  #endif /* _TRACE_BTRFS_H */
>  
>  /* This part must be outside protection */
> 


-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint
  2017-02-27  7:10 ` [PATCH 2/9] btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint Qu Wenruo
@ 2017-03-07 19:11   ` Jeff Mahoney
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Mahoney @ 2017-03-07 19:11 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, dsterba, clm


[-- Attachment #1.1: Type: text/plain, Size: 2509 bytes --]

On 2/27/17 2:10 AM, Qu Wenruo wrote:
> Newly introduced qgroup reserved space trace points are normally nested
> into several common qgroup operations.
> 
> While some other trace points are not well placed to co-operate with
> them, causing confusing output.
> 
> This patch re-arrange trace_btrfs_qgroup_release_data() and
> trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
> before reserved space ones.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Reviewed-by: Jeff Mahoney <jeffm@suse.com>

> ---
>  fs/btrfs/qgroup.c | 10 +++++-----
>  fs/btrfs/qgroup.h |  6 +-----
>  2 files changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index b303d4794026..e46499cafdbf 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2895,14 +2895,14 @@ static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len,
>  	if (ret < 0)
>  		goto out;
>  
> -	if (free) {
> -		btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info,
> -				BTRFS_I(inode)->root->objectid,
> -				changeset.bytes_changed);
> +	if (free)
>  		trace_op = QGROUP_FREE;
> -	}
>  	trace_btrfs_qgroup_release_data(inode, start, len,
>  					changeset.bytes_changed, trace_op);
> +	if (free)
> +		btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info,
> +				BTRFS_I(inode)->root->objectid,
> +				changeset.bytes_changed);
>  out:
>  	ulist_release(&changeset.range_changed);
>  	return ret;
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 26932a8a1993..92608edf0e89 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -186,15 +186,11 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans,
>  			 struct btrfs_qgroup_inherit *inherit);
>  void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
>  			       u64 ref_root, u64 num_bytes);
> -/*
> - * TODO: Add proper trace point for it, as btrfs_qgroup_free() is
> - * called by everywhere, can't provide good trace for delayed ref case.
> - */
>  static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info,
>  						 u64 ref_root, u64 num_bytes)
>  {
> -	btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
>  	trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
> +	btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
>  }
>  void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);
>  
> 


-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
  2017-02-27  7:10 ` [PATCH 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option Qu Wenruo
@ 2017-03-07 19:21   ` Jeff Mahoney
  2017-03-08  0:36     ` Qu Wenruo
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Mahoney @ 2017-03-07 19:21 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, dsterba, clm


[-- Attachment #1.1: Type: text/plain, Size: 3333 bytes --]

On 2/27/17 2:10 AM, Qu Wenruo wrote:
> [BUG]
> The easist way to reproduce the bug is:
> ------
>  # mkfs.btrfs -f $dev -n 16K
>  # mount $dev $mnt -o inode_cache
>  # btrfs quota enable $mnt
>  # btrfs quota rescan -w $mnt
>  # btrfs qgroup show $mnt
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          32.00KiB     32.00KiB
>              ^^ Twice the correct value
> ------
> 
> And fstests/btrfs qgroup test group can easily detect them with
> inode_cache mount option.
> Although some of them are false alerts since old test cases are using
> fixed golden output.
> While new test cases will use "btrfs check" to detect qgroup mismatch.
> 
> [CAUSE]
> Inode_cache mount option will make commit_fs_roots() to call
> btrfs_save_ino_cache() to update fs/subvol trees, and generate new
> delayed refs.
> 
> However we call btrfs_qgroup_prepare_account_extents() too early, before
> commit_fs_roots().
> This makes the "old_roots" for newly generated extents are always NULL.
> For freeing extent case, this makes both new_roots and old_roots to be
> empty, while correct old_roots should not be empty.
> This causing qgroup numbers not decreased correctly.
> 
> [FIX]
> Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
> just before btrfs_qgroup_account_extents(), and add needed delayed_refs
> handler.
> So qgroup can handle inode_map mount options correctly.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Could we also fix this by excepting the free space inode from qgroups?
It seems like this is something we'd want to do anyway unless we want to
handle EDQUOT there too.

-Jeff

> ---
>  fs/btrfs/transaction.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 6b3e0fc2fe7a..1ff3ec797356 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2130,13 +2130,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  		goto scrub_continue;
>  	}
>  
> -	/* Reocrd old roots for later qgroup accounting */
> -	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> -	if (ret) {
> -		mutex_unlock(&fs_info->reloc_mutex);
> -		goto scrub_continue;
> -	}
> -
>  	/*
>  	 * make sure none of the code above managed to slip in a
>  	 * delayed item
> @@ -2179,6 +2172,24 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	btrfs_free_log_root_tree(trans, fs_info);
>  
>  	/*
> +	 * commit_fs_roots() can call btrfs_save_ino_cache(), which generates
> +	 * new delayed refs. Must handle them or qgroup can be wrong.
> +	 */
> +	ret = btrfs_run_delayed_refs(trans, fs_info, (unsigned long)-1);
> +	if (ret) {
> +		mutex_unlock(&fs_info->tree_log_mutex);
> +		mutex_unlock(&fs_info->reloc_mutex);
> +		goto scrub_continue;
> +	}
> +
> +	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> +	if (ret) {
> +		mutex_unlock(&fs_info->tree_log_mutex);
> +		mutex_unlock(&fs_info->reloc_mutex);
> +		goto scrub_continue;
> +	}
> +
> +	/*
>  	 * Since fs roots are all committed, we can get a quite accurate
>  	 * new_roots. So let's do quota accounting.
>  	 */
> 


-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
  2017-03-07 15:13 ` David Sterba
@ 2017-03-08  0:28   ` Qu Wenruo
  2017-03-14 13:32     ` David Sterba
  0 siblings, 1 reply; 25+ messages in thread
From: Qu Wenruo @ 2017-03-08  0:28 UTC (permalink / raw)
  To: dsterba, linux-btrfs, dsterba, clm



At 03/07/2017 11:13 PM, David Sterba wrote:
> On Mon, Feb 27, 2017 at 03:10:30PM +0800, Qu Wenruo wrote:
>> Pull request can be fetched from my github:
>> https://github.com/adam900710/linux.git qgroup_fixes_for_4.11
>>
>> The base is 6288d6eabc7505f42dda34a2c2962f91914be3a4.
>> Author: Liu Bo <bo.li.liu@oracle.com>
>> Date:   Tue Feb 21 12:12:58 2017 -0800
>>
>>     Btrfs: use the correct type when creating cow dio extent
>>
>> These patches are already in mail list for almost 3 months.
>> For example the tracepoint patch is last updated at 2016/12/09.
>>
>> With this patchset btrfs can pass all xfstest qgroup tests now.
>>
>> This pull request should be good for 4.11 as they are all bug fixes and
>> has been delayed for several times.
>>
>> I don't know if these patchset will be delayed again if someone wants to
>> cleanup something else, and cause rebase conflicts to delay such fixes.
>
> I have rebased the branch on top of the cleanups and added it as
> separate branch to for-next. For the next dev cycle I'll delay merging the
> intrusive cleanups until the end.
>
>> But I must say, that's very frustrating to see bug fixes just get dropped
>> again and again just due to new cleanups and lack of reviewers.
>
> That's life of a kernel developer.

Right, that's life.

That's also why I like btrfs-progs a little more.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
  2017-03-07 19:21   ` Jeff Mahoney
@ 2017-03-08  0:36     ` Qu Wenruo
  2017-03-08 14:23       ` Jeff Mahoney
  0 siblings, 1 reply; 25+ messages in thread
From: Qu Wenruo @ 2017-03-08  0:36 UTC (permalink / raw)
  To: Jeff Mahoney, linux-btrfs, dsterba, clm



At 03/08/2017 03:21 AM, Jeff Mahoney wrote:
> On 2/27/17 2:10 AM, Qu Wenruo wrote:
>> [BUG]
>> The easist way to reproduce the bug is:
>> ------
>>  # mkfs.btrfs -f $dev -n 16K
>>  # mount $dev $mnt -o inode_cache
>>  # btrfs quota enable $mnt
>>  # btrfs quota rescan -w $mnt
>>  # btrfs qgroup show $mnt
>> qgroupid         rfer         excl
>> --------         ----         ----
>> 0/5          32.00KiB     32.00KiB
>>              ^^ Twice the correct value
>> ------
>>
>> And fstests/btrfs qgroup test group can easily detect them with
>> inode_cache mount option.
>> Although some of them are false alerts since old test cases are using
>> fixed golden output.
>> While new test cases will use "btrfs check" to detect qgroup mismatch.
>>
>> [CAUSE]
>> Inode_cache mount option will make commit_fs_roots() to call
>> btrfs_save_ino_cache() to update fs/subvol trees, and generate new
>> delayed refs.
>>
>> However we call btrfs_qgroup_prepare_account_extents() too early, before
>> commit_fs_roots().
>> This makes the "old_roots" for newly generated extents are always NULL.
>> For freeing extent case, this makes both new_roots and old_roots to be
>> empty, while correct old_roots should not be empty.
>> This causing qgroup numbers not decreased correctly.
>>
>> [FIX]
>> Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
>> just before btrfs_qgroup_account_extents(), and add needed delayed_refs
>> handler.
>> So qgroup can handle inode_map mount options correctly.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>
> Could we also fix this by excepting the free space inode from qgroups?
> It seems like this is something we'd want to do anyway unless we want to
> handle EDQUOT there too.

I'm afraid it's not possible here.

As qgroup accounts any tree and data blocks that belong to specified 
tree(fs and subvolumes), not caring the inode it belongs to.

And the design of free inode space cache is to restore it in 
fs/subvolume tree, which has no difference with normal inode, except it 
doesn't have INODE_REF and its ino is FREE_INO.
(Much the same behavior for space cache inode)

The correct solution for caching free space and inode should be the new 
space cache tree, which puts all these info into their own tree, never 
affecting existing trees.

The only good news is, inode_cache is not commonly used and IIRC has 
bugs. Maybe it will be good idea to depreciate the option?

Thanks,
Qu

Thanks,
Qu

>
> -Jeff
>
>> ---
>>  fs/btrfs/transaction.c | 25 ++++++++++++++++++-------
>>  1 file changed, 18 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index 6b3e0fc2fe7a..1ff3ec797356 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -2130,13 +2130,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>  		goto scrub_continue;
>>  	}
>>
>> -	/* Reocrd old roots for later qgroup accounting */
>> -	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>> -	if (ret) {
>> -		mutex_unlock(&fs_info->reloc_mutex);
>> -		goto scrub_continue;
>> -	}
>> -
>>  	/*
>>  	 * make sure none of the code above managed to slip in a
>>  	 * delayed item
>> @@ -2179,6 +2172,24 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>  	btrfs_free_log_root_tree(trans, fs_info);
>>
>>  	/*
>> +	 * commit_fs_roots() can call btrfs_save_ino_cache(), which generates
>> +	 * new delayed refs. Must handle them or qgroup can be wrong.
>> +	 */
>> +	ret = btrfs_run_delayed_refs(trans, fs_info, (unsigned long)-1);
>> +	if (ret) {
>> +		mutex_unlock(&fs_info->tree_log_mutex);
>> +		mutex_unlock(&fs_info->reloc_mutex);
>> +		goto scrub_continue;
>> +	}
>> +
>> +	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>> +	if (ret) {
>> +		mutex_unlock(&fs_info->tree_log_mutex);
>> +		mutex_unlock(&fs_info->reloc_mutex);
>> +		goto scrub_continue;
>> +	}
>> +
>> +	/*
>>  	 * Since fs roots are all committed, we can get a quite accurate
>>  	 * new_roots. So let's do quota accounting.
>>  	 */
>>
>
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
  2017-03-08  0:36     ` Qu Wenruo
@ 2017-03-08 14:23       ` Jeff Mahoney
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Mahoney @ 2017-03-08 14:23 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, dsterba, clm


[-- Attachment #1.1: Type: text/plain, Size: 4836 bytes --]

On 3/7/17 7:36 PM, Qu Wenruo wrote:
> 
> 
> At 03/08/2017 03:21 AM, Jeff Mahoney wrote:
>> On 2/27/17 2:10 AM, Qu Wenruo wrote:
>>> [BUG]
>>> The easist way to reproduce the bug is:
>>> ------
>>>  # mkfs.btrfs -f $dev -n 16K
>>>  # mount $dev $mnt -o inode_cache
>>>  # btrfs quota enable $mnt
>>>  # btrfs quota rescan -w $mnt
>>>  # btrfs qgroup show $mnt
>>> qgroupid         rfer         excl
>>> --------         ----         ----
>>> 0/5          32.00KiB     32.00KiB
>>>              ^^ Twice the correct value
>>> ------
>>>
>>> And fstests/btrfs qgroup test group can easily detect them with
>>> inode_cache mount option.
>>> Although some of them are false alerts since old test cases are using
>>> fixed golden output.
>>> While new test cases will use "btrfs check" to detect qgroup mismatch.
>>>
>>> [CAUSE]
>>> Inode_cache mount option will make commit_fs_roots() to call
>>> btrfs_save_ino_cache() to update fs/subvol trees, and generate new
>>> delayed refs.
>>>
>>> However we call btrfs_qgroup_prepare_account_extents() too early, before
>>> commit_fs_roots().
>>> This makes the "old_roots" for newly generated extents are always NULL.
>>> For freeing extent case, this makes both new_roots and old_roots to be
>>> empty, while correct old_roots should not be empty.
>>> This causing qgroup numbers not decreased correctly.
>>>
>>> [FIX]
>>> Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
>>> just before btrfs_qgroup_account_extents(), and add needed delayed_refs
>>> handler.
>>> So qgroup can handle inode_map mount options correctly.
>>>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>
>> Could we also fix this by excepting the free space inode from qgroups?
>> It seems like this is something we'd want to do anyway unless we want to
>> handle EDQUOT there too.
> 
> I'm afraid it's not possible here.

Not within the context of this patch set, but I was thinking we could
just never do the tracing up front so there'd be nothing to check later.
 I still think this is probably the right approach for non-fsroots but
it does seem like there's no way to do it easily on a per-inode basis.

> As qgroup accounts any tree and data blocks that belong to specified
> tree(fs and subvolumes), not caring the inode it belongs to.
> 
> And the design of free inode space cache is to restore it in
> fs/subvolume tree, which has no difference with normal inode, except it
> doesn't have INODE_REF and its ino is FREE_INO.
> (Much the same behavior for space cache inode)
> 
> The correct solution for caching free space and inode should be the new
> space cache tree, which puts all these info into their own tree, never
> affecting existing trees.

Agreed.

> The only good news is, inode_cache is not commonly used and IIRC has
> bugs. Maybe it will be good idea to depreciate the option?

I wouldn't object to that.

-Jeff

>>> ---
>>>  fs/btrfs/transaction.c | 25 ++++++++++++++++++-------
>>>  1 file changed, 18 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index 6b3e0fc2fe7a..1ff3ec797356 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -2130,13 +2130,6 @@ int btrfs_commit_transaction(struct
>>> btrfs_trans_handle *trans)
>>>          goto scrub_continue;
>>>      }
>>>
>>> -    /* Reocrd old roots for later qgroup accounting */
>>> -    ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>>> -    if (ret) {
>>> -        mutex_unlock(&fs_info->reloc_mutex);
>>> -        goto scrub_continue;
>>> -    }
>>> -
>>>      /*
>>>       * make sure none of the code above managed to slip in a
>>>       * delayed item
>>> @@ -2179,6 +2172,24 @@ int btrfs_commit_transaction(struct
>>> btrfs_trans_handle *trans)
>>>      btrfs_free_log_root_tree(trans, fs_info);
>>>
>>>      /*
>>> +     * commit_fs_roots() can call btrfs_save_ino_cache(), which
>>> generates
>>> +     * new delayed refs. Must handle them or qgroup can be wrong.
>>> +     */
>>> +    ret = btrfs_run_delayed_refs(trans, fs_info, (unsigned long)-1);
>>> +    if (ret) {
>>> +        mutex_unlock(&fs_info->tree_log_mutex);
>>> +        mutex_unlock(&fs_info->reloc_mutex);
>>> +        goto scrub_continue;
>>> +    }
>>> +
>>> +    ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>>> +    if (ret) {
>>> +        mutex_unlock(&fs_info->tree_log_mutex);
>>> +        mutex_unlock(&fs_info->reloc_mutex);
>>> +        goto scrub_continue;
>>> +    }
>>> +
>>> +    /*
>>>       * Since fs roots are all committed, we can get a quite accurate
>>>       * new_roots. So let's do quota accounting.
>>>       */
>>>
>>
>>
> 
> 
> 


-- 
Jeff Mahoney
SUSE Labs



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
  2017-03-06  8:08 ` [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
  2017-03-06 16:44   ` David Sterba
@ 2017-03-08 14:24   ` Jeff Mahoney
  2017-03-14 13:17     ` David Sterba
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Mahoney @ 2017-03-08 14:24 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, dsterba, clm


[-- Attachment #1.1: Type: text/plain, Size: 4725 bytes --]

On 3/6/17 3:08 AM, Qu Wenruo wrote:
> Any response?
> 
> These patches are already here for at least 2 kernel releases.
> And are all bug fixes, and fix bugs that are already reported.
> 
> I didn't see any reason why it should be delayed for so long time.

I'll work my way through as I can.  I posted some review earlier today.

Man, I wish we could just get rid of the hierarchical part of qgroups. :)

-Jeff

> Thanks,
> Qu
> 
> At 02/27/2017 03:10 PM, Qu Wenruo wrote:
>> Pull request can be fetched from my github:
>> https://github.com/adam900710/linux.git qgroup_fixes_for_4.11
>>
>> The base is 6288d6eabc7505f42dda34a2c2962f91914be3a4.
>> Author: Liu Bo <bo.li.liu@oracle.com>
>> Date:   Tue Feb 21 12:12:58 2017 -0800
>>
>>     Btrfs: use the correct type when creating cow dio extent
>>
>> These patches are already in mail list for almost 3 months.
>> For example the tracepoint patch is last updated at 2016/12/09.
>>
>> With this patchset btrfs can pass all xfstest qgroup tests now.
>>
>> This pull request should be good for 4.11 as they are all bug fixes and
>> has been delayed for several times.
>>
>> I don't know if these patchset will be delayed again if someone wants to
>> cleanup something else, and cause rebase conflicts to delay such fixes.
>> But I must say, that's very frustrating to see bug fixes just get dropped
>> again and again just due to new cleanups and lack of reviewers.
>>
>> Despite all these pities, this pull request includes:
>> 1) Fix for inode_cache mount option
>>    Although no one really cares inode_cache mount option, it will screw
>>    qgroup for a long time.
>>    Not only it will screw up qgroup test uses golden output, but also
>>    new test cases use btrfsck to verify qgroup.
>>
>> 2) Fix for btrfs/104 kernel warning
>>    This is caused by quota enabling with dirty buffers not written to
>>    disc.
>>
>>    Fixed by checking EXTENT_QGROUP_RESERVED flag other than just
>>    decreasing qgroup->reserved.
>>
>> 3) Fix qgroup underflow caused by freeing ranges not reserved by caller
>>    Mainly exposed by Chandan on PPC64.
>>
>>    It's possible that buffered write is blocked by reserving metadata,
>>    and in error path we will release reserved space for both data and
>>    metadata.
>>
>>    However the reserved data can already be reserved by another process
>>    writing the same page.
>>
>>    In that case, data reserved space can be freed by two process, one
>>    for error path, one for normal write routine, causing underflow.
>>
>>    Fixed by checking if that data range is reserved by ourselves and
>>    only free it if it's reserved by ourselves.
>>
>> Update since 2016/12/09:
>>   Rebased to latest for-linux-4.11.
>>
>>   Add missing reviewed-by and tested-by tags.
>>
>>   Add more comment for btrfs_qgroup_reserve_data() for error handling.
>>
>>   Add more comment for qgroup_free_reserved_data() for later
>>   enhancement (not function enhancement).
>>
>> Qu Wenruo (9):
>>   btrfs: qgroup: Add trace point for qgroup reserved space
>>   btrfs: qgroup: Re-arrange tracepoint timing to co-operate with
>>     reserved space tracepoint
>>   btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount
>>     option
>>   btrfs: qgroup: Add quick exit for non-fs extents
>>   btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
>>   btrfs: qgroup: Return actually freed bytes for qgroup release or free
>>     data
>>   btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered
>>     write and quota enable
>>   btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
>>   btrfs: qgroup: Fix qgroup reserved space underflow by only freeing
>>     reserved ranges
>>
>>  fs/btrfs/ctree.h             |  12 ++-
>>  fs/btrfs/extent-tree.c       |  31 +++---
>>  fs/btrfs/extent_io.h         |  14 ++-
>>  fs/btrfs/file.c              |  46 +++++----
>>  fs/btrfs/inode-map.c         |   6 +-
>>  fs/btrfs/inode.c             |  64 +++++++++----
>>  fs/btrfs/ioctl.c             |  11 ++-
>>  fs/btrfs/qgroup.c            | 221
>> ++++++++++++++++++++++++++++++++-----------
>>  fs/btrfs/qgroup.h            |  14 +--
>>  fs/btrfs/relocation.c        |  13 ++-
>>  fs/btrfs/transaction.c       |  21 ++--
>>  include/trace/events/btrfs.h |  43 +++++++++
>>  12 files changed, 358 insertions(+), 138 deletions(-)
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space
  2017-03-07 16:30   ` Jeff Mahoney
@ 2017-03-09  6:02     ` Qu Wenruo
  0 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2017-03-09  6:02 UTC (permalink / raw)
  To: Jeff Mahoney, linux-btrfs, dsterba, clm



At 03/08/2017 12:30 AM, Jeff Mahoney wrote:
> On 2/27/17 2:10 AM, Qu Wenruo wrote:
>> Introduce the following trace points:
>> qgroup_update_reserve
>> qgroup_meta_reserve
>>
>> These trace points are handy to trace qgroup reserve space related
>> problems.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>  fs/btrfs/qgroup.c            | 15 +++++++++++++++
>>  include/trace/events/btrfs.h | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 58 insertions(+)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index a5da750c1087..b303d4794026 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -1075,6 +1075,8 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info *fs_info,
>>  	qgroup->excl += sign * num_bytes;
>>  	qgroup->excl_cmpr += sign * num_bytes;
>>  	if (sign > 0) {
>> +		trace_qgroup_update_reserve(fs_info, qgroup->qgroupid,
>> +					    qgroup->reserved, -(s64)num_bytes);
>>  		if (WARN_ON(qgroup->reserved < num_bytes))
>>  			report_reserved_underflow(fs_info, qgroup, num_bytes);
>>  		else
>> @@ -1100,6 +1102,9 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info *fs_info,
>>  		WARN_ON(sign < 0 && qgroup->excl < num_bytes);
>>  		qgroup->excl += sign * num_bytes;
>>  		if (sign > 0) {
>> +			trace_qgroup_update_reserve(fs_info, qgroup->qgroupid,
>> +						    qgroup->reserved,
>> +						    -(s64)num_bytes);
>>  			if (WARN_ON(qgroup->reserved < num_bytes))
>>  				report_reserved_underflow(fs_info, qgroup,
>>  							  num_bytes);
>> @@ -2424,6 +2429,8 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce)
>>
>>  		qg = unode_aux_to_qgroup(unode);
>>
>> +		trace_qgroup_update_reserve(fs_info, qg->qgroupid,
>> +					    qg->reserved, num_bytes);
>>  		qg->reserved += num_bytes;
>>  	}
>>
>> @@ -2469,6 +2476,8 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
>>
>>  		qg = unode_aux_to_qgroup(unode);
>>
>> +		trace_qgroup_update_reserve(fs_info, qg->qgroupid,
>> +					    qg->reserved, -(s64)num_bytes);
>>  		if (WARN_ON(qg->reserved < num_bytes))
>>  			report_reserved_underflow(fs_info, qg, num_bytes);
>>  		else
>> @@ -2945,6 +2954,8 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>>  		return 0;
>>
>>  	BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
>> +	trace_qgroup_meta_reserve(fs_info, root->objectid,
>> +				  (s64)num_bytes);
>>  	ret = qgroup_reserve(root, num_bytes, enforce);
>>  	if (ret < 0)
>>  		return ret;
>> @@ -2964,6 +2975,8 @@ void btrfs_qgroup_free_meta_all(struct btrfs_root *root)
>>  	reserved = atomic_xchg(&root->qgroup_meta_rsv, 0);
>>  	if (reserved == 0)
>>  		return;
>> +	trace_qgroup_meta_reserve(root->fs_info, root->objectid,
>> +				  -(s64)reserved);
>>  	btrfs_qgroup_free_refroot(fs_info, root->objectid, reserved);
>>  }
>>
>> @@ -2978,6 +2991,8 @@ void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes)
>>  	BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
>>  	WARN_ON(atomic_read(&root->qgroup_meta_rsv) < num_bytes);
>>  	atomic_sub(num_bytes, &root->qgroup_meta_rsv);
>> +	trace_qgroup_meta_reserve(root->fs_info, root->objectid,
>> +				  -(s64)num_bytes);
>>  	btrfs_qgroup_free_refroot(fs_info, root->objectid, num_bytes);
>>  }
>>
>> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
>> index a3c3cab643a9..2d5799d8a4a2 100644
>> --- a/include/trace/events/btrfs.h
>> +++ b/include/trace/events/btrfs.h
>> @@ -1475,6 +1475,49 @@ TRACE_EVENT(qgroup_update_counters,
>>  		  __entry->cur_new_count)
>>  );
>>
>> +TRACE_EVENT(qgroup_update_reserve,
>> +
>> +	TP_PROTO(struct btrfs_fs_info *fs_info, u64 qgid, u64 cur_reserved,
>> +		 s64 diff),
>> +
>> +	TP_ARGS(fs_info, qgid, cur_reserved, diff),
>> +
>> +	TP_STRUCT__entry_btrfs(
>> +		__field(	u64,	qgid			)
>> +		__field(	u64,	cur_reserved		)
>> +		__field(	s64,	diff			)
>> +	),
>> +
>> +	TP_fast_assign_btrfs(fs_info,
>> +		__entry->qgid		= qgid;
>> +		__entry->cur_reserved	= cur_reserved;
>> +		__entry->diff		= diff;
>> +	),
>> +
>> +	TP_printk_btrfs("qgid = %llu, cur_reserved = %llu, diff = %lld",
>> +		__entry->qgid, __entry->cur_reserved, __entry->diff)
>> +);
>> +
>
> This is always called using a qgroup pointer.  Why not just pass it and
> let the the assignment clause grab the values?

Nice idea, I'll update the parameters along with the output format to 
match the latest standard.

Thanks for the review,
Qu
>
>> +TRACE_EVENT(qgroup_meta_reserve,
>> +
>> +	TP_PROTO(struct btrfs_fs_info *fs_info, u64 refroot, s64 diff),
>> +
>> +	TP_ARGS(fs_info, refroot, diff),
>> +
>> +	TP_STRUCT__entry_btrfs(
>> +		__field(	u64,	refroot			)
>> +		__field(	s64,	diff			)
>> +	),
>> +
>> +	TP_fast_assign_btrfs(fs_info,
>> +		__entry->refroot	= refroot;
>> +		__entry->diff		= diff;
>> +	),
>> +
>> +	TP_printk_btrfs("refroot = %llu, diff = %lld",
>> +		__entry->refroot, __entry->diff)
>> +);
>
> Same here, but with btrfs_root * instead.
>
> -Jeff
>
>> +
>>  #endif /* _TRACE_BTRFS_H */
>>
>>  /* This part must be outside protection */
>>
>
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
  2017-03-08 14:24   ` Jeff Mahoney
@ 2017-03-14 13:17     ` David Sterba
  0 siblings, 0 replies; 25+ messages in thread
From: David Sterba @ 2017-03-14 13:17 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: Qu Wenruo, linux-btrfs, dsterba, clm

On Wed, Mar 08, 2017 at 09:24:05AM -0500, Jeff Mahoney wrote:
> On 3/6/17 3:08 AM, Qu Wenruo wrote:
> > Any response?
> > 
> > These patches are already here for at least 2 kernel releases.
> > And are all bug fixes, and fix bugs that are already reported.
> > 
> > I didn't see any reason why it should be delayed for so long time.
> 
> I'll work my way through as I can.  I posted some review earlier today.
> 
> Man, I wish we could just get rid of the hierarchical part of qgroups. :)

It does complicate things, no doubt about that, but it gives a nice
usecase, where we can account space of a subvolume and all of its
snapshots. The subvolumes can be freely grouped and accounted further,
the particular example are user home directories with snapshots. There
is a usecase for containers, where each root could replicate the same
subvolume structure and the the accounting can be done per container
and/or per group of containers.

IMHO there's not much difference code-wise if we allow more than one
level of the hierarchy, so it's basically flat vs hierarchical.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
  2017-03-08  0:28   ` Qu Wenruo
@ 2017-03-14 13:32     ` David Sterba
  0 siblings, 0 replies; 25+ messages in thread
From: David Sterba @ 2017-03-14 13:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs, dsterba, clm

On Wed, Mar 08, 2017 at 08:28:21AM +0800, Qu Wenruo wrote:
> >> But I must say, that's very frustrating to see bug fixes just get dropped
> >> again and again just due to new cleanups and lack of reviewers.
> >
> > That's life of a kernel developer.
> 
> Right, that's life.
> 
> That's also why I like btrfs-progs a little more.

But you have to deal with the same person there (me), no? :) But yeah,
the userspace programming is considered easier from many aspects.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-03-14 13:33 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-27  7:10 [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
2017-02-27  7:10 ` [PATCH 1/9] btrfs: qgroup: Add trace point for qgroup reserved space Qu Wenruo
2017-03-07 15:19   ` David Sterba
2017-03-07 16:30   ` Jeff Mahoney
2017-03-09  6:02     ` Qu Wenruo
2017-02-27  7:10 ` [PATCH 2/9] btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint Qu Wenruo
2017-03-07 19:11   ` Jeff Mahoney
2017-02-27  7:10 ` [PATCH 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option Qu Wenruo
2017-03-07 19:21   ` Jeff Mahoney
2017-03-08  0:36     ` Qu Wenruo
2017-03-08 14:23       ` Jeff Mahoney
2017-02-27  7:10 ` [PATCH 4/9] btrfs: qgroup: Add quick exit for non-fs extents Qu Wenruo
2017-02-27  7:10 ` [PATCH 5/9] btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function Qu Wenruo
2017-02-27  7:10 ` [PATCH 6/9] btrfs: qgroup: Return actually freed bytes for qgroup release or free data Qu Wenruo
2017-02-27  7:10 ` [PATCH 7/9] btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable Qu Wenruo
2017-02-27  7:10 ` [PATCH 8/9] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions Qu Wenruo
2017-02-27  7:10 ` [PATCH 9/9] btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges Qu Wenruo
2017-03-06  8:08 ` [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11 Qu Wenruo
2017-03-06 16:44   ` David Sterba
2017-03-06 17:56     ` Chris Mason
2017-03-08 14:24   ` Jeff Mahoney
2017-03-14 13:17     ` David Sterba
2017-03-07 15:13 ` David Sterba
2017-03-08  0:28   ` Qu Wenruo
2017-03-14 13:32     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.