linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
@ 2019-01-15  8:15 Qu Wenruo
  2019-01-15  8:15 ` [PATCH v4 1/7] btrfs: qgroup: Move reserved data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record Qu Wenruo
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Qu Wenruo @ 2019-01-15  8:15 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from github:
https://github.com/adam900710/linux/tree/qgroup_delayed_subtree

Which is based on v5.0-rc1.

This patch address the heavy load subtree scan, but delaying it until
we're going to modify the swapped tree block.

The overall workflow is:

1) Record the subtree root block get swapped.

   During subtree swap:
   O = Old tree blocks
   N = New tree blocks
         reloc tree                         file tree X
            Root                               Root
           /    \                             /    \
         NA     OB                          OA      OB
       /  |     |  \                      /  |      |  \
     NC  ND     OE  OF                   OC  OD     OE  OF

  In these case, NA and OA is going to be swapped, record (NA, OA) into
  file tree X.

2) After subtree swap.
         reloc tree                         file tree X
            Root                               Root
           /    \                             /    \
         OA     OB                          NA      OB
       /  |     |  \                      /  |      |  \
     OC  OD     OE  OF                   NC  ND     OE  OF

3a) CoW happens for OB
    If we are going to CoW tree block OB, we check OB's bytenr against
    tree X's swapped_blocks structure.
    It doesn't fit any one, nothing will happen.

3b) CoW happens for NA
    Check NA's bytenr against tree X's swapped_blocks, and get a hit.
    Then we do subtree scan on both subtree OA and NA.
    Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).

    Then no matter what we do to file tree X, qgroup numbers will
    still be correct.
    Then NA's record get removed from X's swapped_blocks.

4)  Transaction commit
    Any record in X's swapped_blocks get removed, since there is no
    modification to swapped subtrees, no need to trigger heavy qgroup
    subtree rescan for them.

[[Benchmark]] (*)
Hardware:
	VM 4G vRAM, 8 vCPUs,
	disk is using 'unsafe' cache mode,
	backing device is SAMSUNG 850 evo SSD.
	Host has 16G ram.

Mkfs parameter:
	--nodesize 4K (To bump up tree size)

Initial subvolume contents:
	4G data copied from /usr and /lib.
	(With enough regular small files)

Snapshots:
	16 snapshots of the original subvolume.
	each snapshot has 3 random files modified.

balance parameter:
	-m

So the content should be pretty similar to a real world root fs layout.

And after file system population, there is no other activity, so it
should be the best case scenario.

                     | v4.20-rc1            | w/ patchset    | diff
-----------------------------------------------------------------------
relocated extents    | 22615                | 22457          | -0.1%
qgroup dirty extents | 163457               | 121606         | -25.6%
time (sys)           | 22.884s              | 18.842s        | -17.6%
time (real)          | 27.724s              | 22.884s        | -17.5%

*: Due to a bug in v5.0-rc1, balancing metadata with snapshots is
unacceptably slow even with quota disabled. So the result is from
v4.20-rc1.

changelog:
v2:
- Rebase to v4.20-rc1.

- Instead commit transaction after each reloc tree merge, delay it until
  merge_reloc_roots() finishes.
  This provides a more natural behavior, and reduce the unnecessary
  transaction commits.

v3:
- Fix backref walk deadlock by not triggering it at all.
  This also removes the need for @exec_post refactor and replace the
  patch to allow @old_root unpopulated.

- Include the patch that fixes the unexpected data rsv free.

v3.1:
- Rebased to v4.20-rc1.
  Minor conflicts with some cleanup code.

v4:
- Renaming members from "file_*" to "subv_*".
  Members like "file_bytenr" is pretty confusing, renaming it to
  "subv_bytenr" avoid the confusion.

- Use btrfs_root::reloc_dirty_list to replace dynamic memory allocation
  One less point of failure, and no need to worry about GFP_KERNEL/NOFS.
  Furthermore, it's easier to manipulate list than rb tree.


Qu Wenruo (7):
  btrfs: qgroup: Move reserved data account from btrfs_delayed_ref_head
    to btrfs_qgroup_extent_record
  btrfs: qgroup: Don't trigger backref walk at delayed ref insert time
  btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots()
  btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap()
  btrfs: qgroup: Introduce per-root swapped blocks infrastructure
  btrfs: qgroup: Use delayed subtree rescan for balance
  btrfs: qgroup: Cleanup old subtree swap code

 fs/btrfs/ctree.c             |   8 +
 fs/btrfs/ctree.h             |  29 +++
 fs/btrfs/delayed-ref.c       |  39 +---
 fs/btrfs/delayed-ref.h       |  11 --
 fs/btrfs/disk-io.c           |   2 +
 fs/btrfs/extent-tree.c       |   3 -
 fs/btrfs/qgroup.c            | 356 ++++++++++++++++++++++++-----------
 fs/btrfs/qgroup.h            | 157 ++++++++++-----
 fs/btrfs/relocation.c        | 100 +++++++---
 fs/btrfs/transaction.c       |   1 +
 include/trace/events/btrfs.h |  29 ---
 11 files changed, 486 insertions(+), 249 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 1/7] btrfs: qgroup: Move reserved data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
@ 2019-01-15  8:15 ` Qu Wenruo
  2019-01-15  8:15 ` [PATCH v4 2/7] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time Qu Wenruo
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2019-01-15  8:15 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
Btrfs/139 will fail with a pretty high possibility if the testing
machine (VM) only has 2G ram.

Resulting the final write success while it should fail due to EDQUOT,
and the result fs will has quota exceeding the limit by 16K.

The simplified reproducer will be: (needs a 2G ram VM)

  mkfs.btrfs -f $dev
  mount $dev $mnt

  btrfs subv create $mnt/subv
  btrfs quota enable $mnt
  btrfs quota rescan -w $mnt
  btrfs qgroup limit -e 1G $mnt/subv

  for i in $(seq -w  1 8); do
  	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
  	echo "file $i written" > /dev/kmsg
  done
  sync
  btrfs qgroup show -pcre --raw $mnt

The last pwrite will not trigger EDQUOT and final qgroup show will show
something like:

  qgroupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5             16384        16384         none         none ---     ---
  0/256      1073758208   1073758208         none   1073741824 ---     ---

And 1073758208 is larger than
  > 1073741824.

[CAUSE]
It's a bug in btrfs qgroup data reserved space management.

For quota limit, we must ensure that:
  reserved (data + metadata) + rfer/excl <= limit

Since rfer/excl is only updated at transaction commmit time, reserved
space needs to be taken special care.

One important part of reserved space is data, and for a new data extent
written to disk, we still need to take the reserved space until
rfer/excl numbers get update.

Originally when an ordered extent finishes, we migrate the reserved
qgroup data space from extent_io tree to delayed ref head of the data
extent, expecting delayed ref will only be cleaned up at commit
transaction time.

However for small RAM machine, due to memory pressure dirty pages can be
flushed back to disk without committing a transaction.

The  related events will be something like:

  BTRFS info (device dm-3): has skinny extents
  file 1 written
  btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
  btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
  btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
  btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
  btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
  cleanup_ref_head: num_bytes=54947840
  cleanup_ref_head: num_bytes=5636096
  cleanup_ref_head: num_bytes=569344
  cleanup_ref_head: num_bytes=57344
  cleanup_ref_head: num_bytes=8192
  ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
  file 2 written
  ...
  file 8 written
  cleanup_ref_head: num_bytes=8192
  ...
  btrfs_commit_transaction  <<< the only transaction committed during
				the test

When file 2 is written, we have already freed 128M reserved qgroup data
space for ino 258. Thus later write won't trigger EDQUOT.

This allows us to write more data beyond qgroup limit.

In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.

[FIX]
By moving reserved qgroup data space from btrfs_delayed_ref_head to
btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
space won't be freed half way before commit transaction, thus fix the
problem.

Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/delayed-ref.c       | 15 ++++-----------
 fs/btrfs/delayed-ref.h       | 11 -----------
 fs/btrfs/extent-tree.c       |  3 ---
 fs/btrfs/qgroup.c            | 19 +++++++++++++++----
 fs/btrfs/qgroup.h            | 20 +++++++++++---------
 include/trace/events/btrfs.h | 29 -----------------------------
 6 files changed, 30 insertions(+), 67 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index cad36c99a483..7d2a413df90d 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -602,17 +602,14 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
 	RB_CLEAR_NODE(&head_ref->href_node);
 	head_ref->processing = 0;
 	head_ref->total_ref_mod = count_mod;
-	head_ref->qgroup_reserved = 0;
-	head_ref->qgroup_ref_root = 0;
 	spin_lock_init(&head_ref->lock);
 	mutex_init(&head_ref->mutex);
 
 	if (qrecord) {
 		if (ref_root && reserved) {
-			head_ref->qgroup_ref_root = ref_root;
-			head_ref->qgroup_reserved = reserved;
+			qrecord->data_rsv = reserved;
+			qrecord->data_rsv_refroot = ref_root;
 		}
-
 		qrecord->bytenr = bytenr;
 		qrecord->num_bytes = num_bytes;
 		qrecord->old_roots = NULL;
@@ -651,10 +648,6 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 	existing = htree_insert(&delayed_refs->href_root,
 				&head_ref->href_node);
 	if (existing) {
-		WARN_ON(qrecord && head_ref->qgroup_ref_root
-			&& head_ref->qgroup_reserved
-			&& existing->qgroup_ref_root
-			&& existing->qgroup_reserved);
 		update_existing_head_ref(trans, existing, head_ref,
 					 old_ref_mod);
 		/*
@@ -770,7 +763,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 
 	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
 	    is_fstree(ref_root)) {
-		record = kmalloc(sizeof(*record), GFP_NOFS);
+		record = kzalloc(sizeof(*record), GFP_NOFS);
 		if (!record) {
 			kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
 			kmem_cache_free(btrfs_delayed_ref_head_cachep, head_ref);
@@ -867,7 +860,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 
 	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
 	    is_fstree(ref_root)) {
-		record = kmalloc(sizeof(*record), GFP_NOFS);
+		record = kzalloc(sizeof(*record), GFP_NOFS);
 		if (!record) {
 			kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
 			kmem_cache_free(btrfs_delayed_ref_head_cachep,
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d2af974f68a1..70606da440aa 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -102,17 +102,6 @@ struct btrfs_delayed_ref_head {
 	 */
 	int ref_mod;
 
-	/*
-	 * For qgroup reserved space freeing.
-	 *
-	 * ref_root and reserved will be recorded after
-	 * BTRFS_ADD_DELAYED_EXTENT is called.
-	 * And will be used to free reserved qgroup space at
-	 * run_delayed_refs() time.
-	 */
-	u64 qgroup_ref_root;
-	u64 qgroup_reserved;
-
 	/*
 	 * when a new extent is allocated, it is just reserved in memory
 	 * The actual extent isn't inserted into the extent allocation tree
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b15afeae16df..208335a2aa29 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2494,9 +2494,6 @@ static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	/* Also free its reserved qgroup space */
-	btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
-				      head->qgroup_reserved);
 	btrfs_delayed_refs_rsv_release(fs_info, nr_items);
 }
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4e473a998219..f214b490d80c 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1546,12 +1546,18 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
 		parent_node = *p;
 		entry = rb_entry(parent_node, struct btrfs_qgroup_extent_record,
 				 node);
-		if (bytenr < entry->bytenr)
+		if (bytenr < entry->bytenr) {
 			p = &(*p)->rb_left;
-		else if (bytenr > entry->bytenr)
+		} else if (bytenr > entry->bytenr) {
 			p = &(*p)->rb_right;
-		else
+		} else {
+			if (record->data_rsv && !entry->data_rsv) {
+				entry->data_rsv = record->data_rsv;
+				entry->data_rsv_refroot =
+					record->data_rsv_refroot;
+			}
 			return 1;
+		}
 	}
 
 	rb_link_node(&record->node, parent_node, p);
@@ -1597,7 +1603,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)
 	    || bytenr == 0 || num_bytes == 0)
 		return 0;
-	record = kmalloc(sizeof(*record), gfp_flag);
+	record = kzalloc(sizeof(*record), gfp_flag);
 	if (!record)
 		return -ENOMEM;
 
@@ -2576,6 +2582,11 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 					goto cleanup;
 			}
 
+			/* Free the reserved data space */
+			btrfs_qgroup_free_refroot(fs_info,
+					record->data_rsv_refroot,
+					record->data_rsv,
+					BTRFS_QGROUP_RSV_DATA);
 			/*
 			 * Use SEQ_LAST as time_seq to do special search, which
 			 * doesn't lock tree or delayed_refs and search current
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 20c6bd5fa701..d4fae53969d4 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -45,6 +45,17 @@ struct btrfs_qgroup_extent_record {
 	struct rb_node node;
 	u64 bytenr;
 	u64 num_bytes;
+
+	/*
+	 * For qgroup reserved data space freeing.
+	 *
+	 * @data_rsv_refroot and @data_rsv will be recorded after
+	 * BTRFS_ADD_DELAYED_EXTENT is called.
+	 * And will be used to free reserved qgroup space at
+	 * transaction commit time.
+	 */
+	u32 data_rsv;		/* reserved data space needs to be freed */
+	u64 data_rsv_refroot;	/* which root the reserved data belongs to */
 	struct ulist *old_roots;
 };
 
@@ -252,15 +263,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
 			       u64 ref_root, u64 num_bytes,
 			       enum btrfs_qgroup_rsv_type type);
-static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info,
-						 u64 ref_root, u64 num_bytes)
-{
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
-		return;
-	trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
-	btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes,
-				  BTRFS_QGROUP_RSV_DATA);
-}
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 2887503e4d12..43d3ee1a6544 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1512,35 +1512,6 @@ DEFINE_EVENT(btrfs__qgroup_rsv_data, btrfs_qgroup_release_data,
 	TP_ARGS(inode, start, len, reserved, op)
 );
 
-DECLARE_EVENT_CLASS(btrfs__qgroup_delayed_ref,
-
-	TP_PROTO(const struct btrfs_fs_info *fs_info,
-		 u64 ref_root, u64 reserved),
-
-	TP_ARGS(fs_info, ref_root, reserved),
-
-	TP_STRUCT__entry_btrfs(
-		__field(	u64,		ref_root	)
-		__field(	u64,		reserved	)
-	),
-
-	TP_fast_assign_btrfs(fs_info,
-		__entry->ref_root	= ref_root;
-		__entry->reserved	= reserved;
-	),
-
-	TP_printk_btrfs("root=%llu reserved=%llu op=free",
-		  __entry->ref_root, __entry->reserved)
-);
-
-DEFINE_EVENT(btrfs__qgroup_delayed_ref, btrfs_qgroup_free_delayed_ref,
-
-	TP_PROTO(const struct btrfs_fs_info *fs_info,
-		 u64 ref_root, u64 reserved),
-
-	TP_ARGS(fs_info, ref_root, reserved)
-);
-
 DECLARE_EVENT_CLASS(btrfs_qgroup_extent,
 	TP_PROTO(const struct btrfs_fs_info *fs_info,
 		 const struct btrfs_qgroup_extent_record *rec),
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 2/7] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
  2019-01-15  8:15 ` [PATCH v4 1/7] btrfs: qgroup: Move reserved data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record Qu Wenruo
@ 2019-01-15  8:15 ` Qu Wenruo
  2019-01-15  8:16 ` [PATCH v4 3/7] btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots() Qu Wenruo
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2019-01-15  8:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Filipe Manana

[BUG]
Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting
time out of commit trans"), kernel may lockup with quota enabled.

There is one backref trace triggered by snapshot dropping along with
write operation in the source subvolume.
The example can be stably reproduced.

  btrfs-cleaner   D    0  4062      2 0x80000000
  Call Trace:
   schedule+0x32/0x90
   btrfs_tree_read_lock+0x93/0x130 [btrfs]
   find_parent_nodes+0x29b/0x1170 [btrfs]
   btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
   btrfs_find_all_roots+0x57/0x70 [btrfs]
   btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
   btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
   btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
   do_walk_down+0x541/0x5e3 [btrfs]
   walk_down_tree+0xab/0xe7 [btrfs]
   btrfs_drop_snapshot+0x356/0x71a [btrfs]
   btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
   cleaner_kthread+0x12b/0x160 [btrfs]
   kthread+0x112/0x130
   ret_from_fork+0x27/0x50

[CAUSE]
When dropping snapshots with qgroup enabled, we will trigger backref
walk.

However such backref walk at that timing is pretty dangerous, as if one
of the parent nodes get WRITE locked by other thread, we could cause a
dead lock.

For example:

           FS 260     FS 261 (Dropped)
            node A        node B
           /      \      /      \
       node C      node D      node E
      /   \         /  \        /     \
  leaf F|leaf G|leaf H|leaf I|leaf J|leaf K

The lock sequence would be:

      Thread A (cleaner)             |       Thread B (other writer)
-----------------------------------------------------------------------
write_lock(B)                        |
write_lock(D)                        |
^^^ called by walk_down_tree()       |
                                     |       write_lock(A)
                                     |       write_lock(D) << Stall
read_lock(H) << for backref walk     |
read_lock(D) << lock owner is        |
                the same thread A    |
                so read lock is OK   |
read_lock(A) << Stall                |

So thread A hold write lock D, and needs read lock A to unlock.
While thread B holds write lock A, while needs lock D to unlock.

This will cause a dead lock.

This is not only limited to snapshot dropping case.
As the backref walk, even only happens on commit trees, is breaking the
normal top-down locking order, makes it deadlock prone.

[FIX]
The solution is not to trigger backref walk with any write lock hold.
This means we can't do backref walk at delayed ref insert time.

So this patch will mostly revert commit fb235dc06fac ("btrfs: qgroup:
Move half of the qgroup accounting time out of commit trans").

Reported-by: David Sterba <dsterba@suse.cz>
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/delayed-ref.c | 24 ++++--------------------
 fs/btrfs/qgroup.c      | 39 +++------------------------------------
 fs/btrfs/qgroup.h      | 30 ++----------------------------
 3 files changed, 9 insertions(+), 84 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 7d2a413df90d..ed42a8305f50 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -625,12 +625,10 @@ static noinline struct btrfs_delayed_ref_head *
 add_delayed_ref_head(struct btrfs_trans_handle *trans,
 		     struct btrfs_delayed_ref_head *head_ref,
 		     struct btrfs_qgroup_extent_record *qrecord,
-		     int action, int *qrecord_inserted_ret,
-		     int *old_ref_mod, int *new_ref_mod)
+		     int action, int *old_ref_mod, int *new_ref_mod)
 {
 	struct btrfs_delayed_ref_head *existing;
 	struct btrfs_delayed_ref_root *delayed_refs;
-	int qrecord_inserted = 0;
 
 	delayed_refs = &trans->transaction->delayed_refs;
 
@@ -639,8 +637,6 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 		if (btrfs_qgroup_trace_extent_nolock(trans->fs_info,
 					delayed_refs, qrecord))
 			kfree(qrecord);
-		else
-			qrecord_inserted = 1;
 	}
 
 	trace_add_delayed_ref_head(trans->fs_info, head_ref, action);
@@ -670,8 +666,6 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 		atomic_inc(&delayed_refs->num_entries);
 		trans->delayed_ref_updates++;
 	}
-	if (qrecord_inserted_ret)
-		*qrecord_inserted_ret = qrecord_inserted;
 	if (new_ref_mod)
 		*new_ref_mod = head_ref->total_ref_mod;
 
@@ -745,7 +739,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 	struct btrfs_delayed_ref_head *head_ref;
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct btrfs_qgroup_extent_record *record = NULL;
-	int qrecord_inserted;
 	bool is_system = (ref_root == BTRFS_CHUNK_TREE_OBJECTID);
 	int ret;
 	u8 ref_type;
@@ -794,8 +787,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 	 * the spin lock
 	 */
 	head_ref = add_delayed_ref_head(trans, head_ref, record,
-					action, &qrecord_inserted,
-					old_ref_mod, new_ref_mod);
+					action, old_ref_mod, new_ref_mod);
 
 	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
 	spin_unlock(&delayed_refs->lock);
@@ -812,9 +804,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 	if (ret > 0)
 		kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
 
-	if (qrecord_inserted)
-		btrfs_qgroup_trace_extent_post(fs_info, record);
-
 	return 0;
 }
 
@@ -832,7 +821,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	struct btrfs_delayed_ref_head *head_ref;
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct btrfs_qgroup_extent_record *record = NULL;
-	int qrecord_inserted;
 	int ret;
 	u8 ref_type;
 
@@ -881,8 +869,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	 * the spin lock
 	 */
 	head_ref = add_delayed_ref_head(trans, head_ref, record,
-					action, &qrecord_inserted,
-					old_ref_mod, new_ref_mod);
+					action, old_ref_mod, new_ref_mod);
 
 	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
 	spin_unlock(&delayed_refs->lock);
@@ -899,9 +886,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	if (ret > 0)
 		kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
 
-
-	if (qrecord_inserted)
-		return btrfs_qgroup_trace_extent_post(fs_info, record);
 	return 0;
 }
 
@@ -926,7 +910,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
 	spin_lock(&delayed_refs->lock);
 
 	add_delayed_ref_head(trans, head_ref, NULL, BTRFS_UPDATE_DELAYED_HEAD,
-			     NULL, NULL, NULL);
+			     NULL, NULL);
 
 	spin_unlock(&delayed_refs->lock);
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index f214b490d80c..ba30adac88a7 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1565,33 +1565,6 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info *fs_info,
-				   struct btrfs_qgroup_extent_record *qrecord)
-{
-	struct ulist *old_root;
-	u64 bytenr = qrecord->bytenr;
-	int ret;
-
-	ret = btrfs_find_all_roots(NULL, fs_info, bytenr, 0, &old_root, false);
-	if (ret < 0) {
-		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
-		btrfs_warn(fs_info,
-"error accounting new delayed refs extent (err code: %d), quota inconsistent",
-			ret);
-		return 0;
-	}
-
-	/*
-	 * Here we don't need to get the lock of
-	 * trans->transaction->delayed_refs, since inserted qrecord won't
-	 * be deleted, only qrecord->node may be modified (new qrecord insert)
-	 *
-	 * So modifying qrecord->old_roots is safe here
-	 */
-	qrecord->old_roots = old_root;
-	return 0;
-}
-
 int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 			      u64 num_bytes, gfp_t gfp_flag)
 {
@@ -1615,11 +1588,9 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 	spin_lock(&delayed_refs->lock);
 	ret = btrfs_qgroup_trace_extent_nolock(fs_info, delayed_refs, record);
 	spin_unlock(&delayed_refs->lock);
-	if (ret > 0) {
+	if (ret > 0)
 		kfree(record);
-		return 0;
-	}
-	return btrfs_qgroup_trace_extent_post(fs_info, record);
+	return 0;
 }
 
 int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
@@ -2569,11 +2540,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 		trace_btrfs_qgroup_account_extents(fs_info, record);
 
 		if (!ret) {
-			/*
-			 * Old roots should be searched when inserting qgroup
-			 * extent record
-			 */
-			if (WARN_ON(!record->old_roots)) {
+			if (!record->old_roots) {
 				/* Search commit root to find old_roots */
 				ret = btrfs_find_all_roots(NULL, fs_info,
 						record->bytenr, 0,
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index d4fae53969d4..226956f0bd73 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -174,8 +174,7 @@ struct btrfs_delayed_extent_op;
  * Inform qgroup to trace one dirty extent, its info is recorded in @record.
  * So qgroup can account it at transaction committing time.
  *
- * No lock version, caller must acquire delayed ref lock and allocated memory,
- * then call btrfs_qgroup_trace_extent_post() after exiting lock context.
+ * No lock version, caller must acquire delayed ref lock and allocate memory,
  *
  * Return 0 for success insert
  * Return >0 for existing record, caller can free @record safely.
@@ -186,37 +185,12 @@ int btrfs_qgroup_trace_extent_nolock(
 		struct btrfs_delayed_ref_root *delayed_refs,
 		struct btrfs_qgroup_extent_record *record);
 
-/*
- * Post handler after qgroup_trace_extent_nolock().
- *
- * NOTE: Current qgroup does the expensive backref walk at transaction
- * committing time with TRANS_STATE_COMMIT_DOING, this blocks incoming
- * new transaction.
- * This is designed to allow btrfs_find_all_roots() to get correct new_roots
- * result.
- *
- * However for old_roots there is no need to do backref walk at that time,
- * since we search commit roots to walk backref and result will always be
- * correct.
- *
- * Due to the nature of no lock version, we can't do backref there.
- * So we must call btrfs_qgroup_trace_extent_post() after exiting
- * spinlock context.
- *
- * TODO: If we can fix and prove btrfs_find_all_roots() can get correct result
- * using current root, then we can move all expensive backref walk out of
- * transaction committing, but not now as qgroup accounting will be wrong again.
- */
-int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info *fs_info,
-				   struct btrfs_qgroup_extent_record *qrecord);
-
 /*
  * Inform qgroup to trace one dirty extent, specified by @bytenr and
  * @num_bytes.
  * So qgroup can account it at commit trans time.
  *
- * Better encapsulated version, with memory allocation and backref walk for
- * commit roots.
+ * Better encapsulated version, with memory allocation
  * So this can sleep.
  *
  * Return 0 if the operation is done.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 3/7] btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots()
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
  2019-01-15  8:15 ` [PATCH v4 1/7] btrfs: qgroup: Move reserved data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record Qu Wenruo
  2019-01-15  8:15 ` [PATCH v4 2/7] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time Qu Wenruo
@ 2019-01-15  8:16 ` Qu Wenruo
  2019-01-22 16:32   ` David Sterba
  2019-01-15  8:16 ` [PATCH v4 4/7] btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap() Qu Wenruo
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-01-15  8:16 UTC (permalink / raw)
  To: linux-btrfs

Relocation code will drop btrfs_root::reloc_root as soon as
merge_reloc_root() finishes.

However later qgroup code will need to access btrfs_root::reloc_root
after merge_reloc_root() for delayed subtree rescan.

So alter the timming of resetting btrfs_root:::reloc_root, make it
happens after transaction commit.

With this patch, we will introduce a new btrfs_root::state,
BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user
that although btrfs_root::reloc_tree is still non-NULL, but still it's
not used any more.

The lifespan of btrfs_root::reloc tree will become:
          Old behavior            |              New
------------------------------------------------------------------------
btrfs_init_reloc_root()      ---  | btrfs_init_reloc_root()      ---
  set reloc_root              |   |   set reloc_root              |
                              |   |                               |
                              |   |                               |
merge_reloc_root()            |   | merge_reloc_root()            |
|- btrfs_update_reloc_root() ---  | |- btrfs_update_reloc_root() -+-
     clear btrfs_root::reloc_root |      set ROOT_DEAD_RELOC_TREE |
                                  |      record root into dirty   |
                                  |      roots rbtree             |
                                  |                               |
                                  | reloc_block_group() Or        |
                                  | btrfs_recover_relocation()    |
                                  | | After transaction commit    |
                                  | |- clean_dirty_subvs()       ---
                                  |     clear btrfs_root::reloc_root

During ROOT_DEAD_RELOC_TREE set lifespan, the only user of
btrfs_root::reloc_tree should be qgroup.

And to co-operate this, also delayed btrfs_drop_snapshot() call on reloc
tree, btrfs_drop_snapshot() call will also be delayed to
clean_dirty_subvs().

This patch will increase the size of btrfs_root by 16 bytes.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h      | 15 ++++++++
 fs/btrfs/disk-io.c    |  1 +
 fs/btrfs/relocation.c | 85 ++++++++++++++++++++++++++++++++++---------
 3 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0a68cf7032f5..2c374dfb6aec 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1195,6 +1195,13 @@ enum {
 	BTRFS_ROOT_MULTI_LOG_TASKS,
 	BTRFS_ROOT_DIRTY,
 	BTRFS_ROOT_DELETING,
+
+	/*
+	 * Reloc tree is orphan, only kept here for qgroup delayed subtree scan
+	 *
+	 * Set for the subvolume tree owning the reloc tree.
+	 */
+	BTRFS_ROOT_DEAD_RELOC_TREE,
 };
 
 /*
@@ -1307,6 +1314,14 @@ struct btrfs_root {
 	struct list_head ordered_root;
 	u64 nr_ordered_extents;
 
+	/*
+	 * Not empty if this subvolume root has gone through tree block swap
+	 * (relocation)
+	 *
+	 * Will be used by reloc_control::dirty_subv_roots.
+	 */
+	struct list_head reloc_dirty_list;
+
 	/*
 	 * Number of currently running SEND ioctls to prevent
 	 * manipulation with the read-only status via SUBVOL_SETFLAGS
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8da2f380d3c0..bfefa1de0455 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1175,6 +1175,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&root->delalloc_root);
 	INIT_LIST_HEAD(&root->ordered_extents);
 	INIT_LIST_HEAD(&root->ordered_root);
+	INIT_LIST_HEAD(&root->reloc_dirty_list);
 	INIT_LIST_HEAD(&root->logged_list[0]);
 	INIT_LIST_HEAD(&root->logged_list[1]);
 	spin_lock_init(&root->inode_lock);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 272b287f8cf0..1d5cfceb46c1 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -162,6 +162,8 @@ struct reloc_control {
 	struct mapping_tree reloc_root_tree;
 	/* list of reloc trees */
 	struct list_head reloc_roots;
+	/* list of subvolume trees who get relocated */
+	struct list_head dirty_subv_roots;
 	/* size of metadata reservation for merging reloc trees */
 	u64 merging_rsv_size;
 	/* size of relocated tree nodes */
@@ -1467,15 +1469,17 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
 	struct btrfs_root_item *root_item;
 	int ret;
 
-	if (!root->reloc_root)
+	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state) ||
+	    !root->reloc_root)
 		goto out;
 
 	reloc_root = root->reloc_root;
 	root_item = &reloc_root->root_item;
 
+	/* root->reloc_root will stay until current relocation finished */
 	if (fs_info->reloc_ctl->merge_reloc_tree &&
 	    btrfs_root_refs(root_item) == 0) {
-		root->reloc_root = NULL;
+		set_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
 		__del_reloc_root(reloc_root);
 	}
 
@@ -2120,6 +2124,58 @@ static int find_next_key(struct btrfs_path *path, int level,
 	return 1;
 }
 
+/*
+ * Helper to insert current subvolume root into reloc_control::dirty_subv_roots
+ */
+static void insert_dirty_subv(struct btrfs_trans_handle *trans,
+			      struct reloc_control *rc, struct btrfs_root *root)
+{
+	struct btrfs_root *reloc_root = root->reloc_root;
+	struct btrfs_root_item *reloc_root_item;
+	u64 root_objectid = root->root_key.objectid;
+
+	/* @root must be a file tree root with a valid reloc tree */
+	ASSERT(root_objectid != BTRFS_TREE_RELOC_OBJECTID);
+	ASSERT(reloc_root);
+
+	reloc_root_item = &reloc_root->root_item;
+	memset(&reloc_root_item->drop_progress, 0,
+		sizeof(reloc_root_item->drop_progress));
+	reloc_root_item->drop_level = 0;
+	btrfs_set_root_refs(reloc_root_item, 0);
+	btrfs_update_reloc_root(trans, root);
+
+	if (list_empty(&root->reloc_dirty_list)) {
+		btrfs_grab_fs_root(root);
+		list_add_tail(&root->reloc_dirty_list, &rc->dirty_subv_roots);
+	}
+	return;
+}
+
+static int clean_dirty_subvs(struct reloc_control *rc)
+{
+	struct btrfs_root *root;
+	struct btrfs_root *next;
+	int err = 0;
+	int ret;
+
+	list_for_each_entry_safe(root, next, &rc->dirty_subv_roots,
+				 reloc_dirty_list) {
+		struct btrfs_root *reloc_root = root->reloc_root;
+
+		clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
+		list_del_init(&root->reloc_dirty_list);
+		root->reloc_root = NULL;
+		if (reloc_root) {
+			ret = btrfs_drop_snapshot(reloc_root, NULL, 0, 1);
+			if (ret < 0 && !err)
+				err = ret;
+		}
+		btrfs_put_fs_root(root);
+	}
+	return err;
+}
+
 /*
  * merge the relocated tree blocks in reloc tree with corresponding
  * fs tree.
@@ -2259,13 +2315,8 @@ static noinline_for_stack int merge_reloc_root(struct reloc_control *rc,
 out:
 	btrfs_free_path(path);
 
-	if (err == 0) {
-		memset(&root_item->drop_progress, 0,
-		       sizeof(root_item->drop_progress));
-		root_item->drop_level = 0;
-		btrfs_set_root_refs(root_item, 0);
-		btrfs_update_reloc_root(trans, root);
-	}
+	if (err == 0)
+		insert_dirty_subv(trans, rc, root);
 
 	if (trans)
 		btrfs_end_transaction_throttle(trans);
@@ -2410,14 +2461,6 @@ void merge_reloc_roots(struct reloc_control *rc)
 		} else {
 			list_del_init(&reloc_root->root_list);
 		}
-
-		ret = btrfs_drop_snapshot(reloc_root, rc->block_rsv, 0, 1);
-		if (ret < 0) {
-			if (list_empty(&reloc_root->root_list))
-				list_add_tail(&reloc_root->root_list,
-					      &reloc_roots);
-			goto out;
-		}
 	}
 
 	if (found) {
@@ -4079,6 +4122,9 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 		goto out_free;
 	}
 	btrfs_commit_transaction(trans);
+	ret = clean_dirty_subvs(rc);
+	if (ret < 0 && !err)
+		err = ret;
 out_free:
 	btrfs_free_block_rsv(fs_info, rc->block_rsv);
 	btrfs_free_path(path);
@@ -4173,6 +4219,7 @@ static struct reloc_control *alloc_reloc_control(void)
 		return NULL;
 
 	INIT_LIST_HEAD(&rc->reloc_roots);
+	INIT_LIST_HEAD(&rc->dirty_subv_roots);
 	backref_cache_init(&rc->backref_cache);
 	mapping_tree_init(&rc->reloc_root_tree);
 	extent_io_tree_init(&rc->processed_blocks, NULL);
@@ -4468,6 +4515,10 @@ int btrfs_recover_relocation(struct btrfs_root *root)
 		goto out_free;
 	}
 	err = btrfs_commit_transaction(trans);
+
+	ret = clean_dirty_subvs(rc);
+	if (ret < 0 && !err)
+		err = ret;
 out_free:
 	kfree(rc);
 out:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 4/7] btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap()
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
                   ` (2 preceding siblings ...)
  2019-01-15  8:16 ` [PATCH v4 3/7] btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots() Qu Wenruo
@ 2019-01-15  8:16 ` Qu Wenruo
  2019-01-22 16:38   ` David Sterba
  2019-01-15  8:16 ` [PATCH v4 5/7] btrfs: qgroup: Introduce per-root swapped blocks infrastructure Qu Wenruo
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-01-15  8:16 UTC (permalink / raw)
  To: linux-btrfs

Refactor btrfs_qgroup_trace_subtree_swap() into
qgroup_trace_subtree_swap(), which only needs two extent buffer and some
other bool to control the behavior.

This provides the basis for later delayed subtree scan work.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 78 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 57 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index ba30adac88a7..8fe6ebe9aef8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1994,6 +1994,60 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
 	return ret;
 }
 
+static int qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
+				struct extent_buffer *src_eb,
+				struct extent_buffer *dst_eb,
+				u64 last_snapshot, bool trace_leaf)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_path *dst_path = NULL;
+	int level;
+	int ret;
+
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+		return 0;
+
+	/* Wrong parameter order */
+	if (btrfs_header_generation(src_eb) > btrfs_header_generation(dst_eb)) {
+		btrfs_err_rl(fs_info,
+		"%s: bad parameter order, src_gen=%llu dst_gen=%llu", __func__,
+			     btrfs_header_generation(src_eb),
+			     btrfs_header_generation(dst_eb));
+		return -EUCLEAN;
+	}
+
+	if (!extent_buffer_uptodate(src_eb) ||
+	    !extent_buffer_uptodate(dst_eb)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	level = btrfs_header_level(dst_eb);
+	dst_path = btrfs_alloc_path();
+	if (!dst_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* For dst_path */
+	extent_buffer_get(dst_eb);
+	dst_path->nodes[level] = dst_eb;
+	dst_path->slots[level] = 0;
+	dst_path->locks[level] = 0;
+
+	/* Do the generation aware breadth-first search */
+	ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
+					      level, last_snapshot, trace_leaf);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+
+out:
+	btrfs_free_path(dst_path);
+	if (ret < 0)
+		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
+	return ret;
+}
+
 /*
  * Inform qgroup to trace subtree swap used in balance.
  *
@@ -2019,14 +2073,12 @@ int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 				u64 last_snapshot)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_path *dst_path = NULL;
 	struct btrfs_key first_key;
 	struct extent_buffer *src_eb = NULL;
 	struct extent_buffer *dst_eb = NULL;
 	bool trace_leaf = false;
 	u64 child_gen;
 	u64 child_bytenr;
-	int level;
 	int ret;
 
 	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
@@ -2077,22 +2129,9 @@ int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
-	level = btrfs_header_level(dst_eb);
-	dst_path = btrfs_alloc_path();
-	if (!dst_path) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	/* For dst_path */
-	extent_buffer_get(dst_eb);
-	dst_path->nodes[level] = dst_eb;
-	dst_path->slots[level] = 0;
-	dst_path->locks[level] = 0;
-
-	/* Do the generation-aware breadth-first search */
-	ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
-					      level, last_snapshot, trace_leaf);
+	/* Do the generation aware breadth-first search */
+	ret = qgroup_trace_subtree_swap(trans, src_eb, dst_eb, last_snapshot,
+					trace_leaf);
 	if (ret < 0)
 		goto out;
 	ret = 0;
@@ -2100,9 +2139,6 @@ int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 out:
 	free_extent_buffer(src_eb);
 	free_extent_buffer(dst_eb);
-	btrfs_free_path(dst_path);
-	if (ret < 0)
-		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
 	return ret;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 5/7] btrfs: qgroup: Introduce per-root swapped blocks infrastructure
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
                   ` (3 preceding siblings ...)
  2019-01-15  8:16 ` [PATCH v4 4/7] btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap() Qu Wenruo
@ 2019-01-15  8:16 ` Qu Wenruo
  2019-01-22 16:55   ` David Sterba
  2019-01-15  8:16 ` [PATCH v4 6/7] btrfs: qgroup: Use delayed subtree rescan for balance Qu Wenruo
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-01-15  8:16 UTC (permalink / raw)
  To: linux-btrfs

To allow delayed subtree swap rescan, btrfs needs to record per-root
info about which tree blocks get swapped.

So this patch introduces per-root btrfs_qgroup_swapped_blocks structure,
which records which tree blocks get swapped.

The designed workflow will be:

1) Record the subtree root block get swapped.

   During subtree swap:
   O = Old tree blocks
   N = New tree blocks
         reloc tree                         subvolume tree X
            Root                               Root
           /    \                             /    \
         NA     OB                          OA      OB
       /  |     |  \                      /  |      |  \
     NC  ND     OE  OF                   OC  OD     OE  OF

  In these case, NA and OA is going to be swapped, record (NA, OA) into
  subvolume tree X.

2) After subtree swap.
         reloc tree                         subvolume tree X
            Root                               Root
           /    \                             /    \
         OA     OB                          NA      OB
       /  |     |  \                      /  |      |  \
     OC  OD     OE  OF                   NC  ND     OE  OF

3a) CoW happens for OB
    If we are going to CoW tree block OB, we check OB's bytenr against
    tree X's swapped_blocks structure.
    It doesn't fit any one, nothing will happen.

3b) CoW happens for NA
    Check NA's bytenr against tree X's swapped_blocks, and get a hit.
    Then we do subtree scan on both subtree OA and NA.
    Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).

    Then no matter what we do to subvolume tree X, qgroup numbers will
    still be correct.
    Then NA's record get removed from X's swapped_blocks.

4)  Transaction commit
    Any record in X's swapped_blocks get removed, since there is no
    modification to swapped subtrees, no need to trigger heavy qgroup
    subtree rescan for them.

This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h       |  14 +++++
 fs/btrfs/disk-io.c     |   1 +
 fs/btrfs/qgroup.c      | 130 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h      |  99 +++++++++++++++++++++++++++++++
 fs/btrfs/relocation.c  |   7 +++
 fs/btrfs/transaction.c |   1 +
 6 files changed, 252 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c374dfb6aec..b93ca931af08 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1204,6 +1204,17 @@ enum {
 	BTRFS_ROOT_DEAD_RELOC_TREE,
 };
 
+/*
+ * Record swapped tree blocks of a file/subvolume tree for delayed subtree
+ * trace code. For detail check comment in fs/btrfs/qgroup.c.
+ */
+struct btrfs_qgroup_swapped_blocks {
+	spinlock_t lock;
+	struct rb_root blocks[BTRFS_MAX_LEVEL];
+	/* RM_EMPTY_ROOT() of above blocks[] */
+	bool swapped;
+};
+
 /*
  * in ram representation of the tree.  extent_root is used for all allocations
  * and for the extent tree extent_root root.
@@ -1339,6 +1350,9 @@ struct btrfs_root {
 	/* Number of active swapfiles */
 	atomic_t nr_swapfiles;
 
+	/* Record pairs of swapped blocks for qgroup */
+	struct btrfs_qgroup_swapped_blocks swapped_blocks;
+
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 	u64 alloc_bytenr;
 #endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bfefa1de0455..31b2facdfc1e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1219,6 +1219,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	root->anon_dev = 0;
 
 	spin_lock_init(&root->root_item_lock);
+	btrfs_qgroup_init_swapped_blocks(&root->swapped_blocks);
 }
 
 static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 8fe6ebe9aef8..001830328960 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3797,3 +3797,133 @@ void btrfs_qgroup_check_reserved_leak(struct inode *inode)
 	}
 	extent_changeset_release(&changeset);
 }
+
+/*
+ * Delete all swapped blocks record of @root.
+ * Every record here means we skipped a full subtree scan for qgroup.
+ *
+ * Get called when commit one transaction.
+ */
+void btrfs_qgroup_clean_swapped_blocks(struct btrfs_root *root)
+{
+	struct btrfs_qgroup_swapped_blocks *swapped_blocks;
+	int i;
+
+	swapped_blocks = &root->swapped_blocks;
+
+	spin_lock(&swapped_blocks->lock);
+	if (!swapped_blocks->swapped)
+		goto out;
+	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
+		struct rb_root *cur_root = &swapped_blocks->blocks[i];
+		struct btrfs_qgroup_swapped_block *entry;
+		struct btrfs_qgroup_swapped_block *next;
+
+		rbtree_postorder_for_each_entry_safe(entry, next, cur_root,
+						     node)
+			kfree(entry);
+		swapped_blocks->blocks[i] = RB_ROOT;
+	}
+	swapped_blocks->swapped = false;
+out:
+	spin_unlock(&swapped_blocks->lock);
+}
+
+/*
+ * Adding subtree roots record into @subv_root.
+ *
+ * @subv_root:		tree root of the subvolume tree get swapped
+ * @bg:			block group under balance
+ * @subv_parent/slot:	pointer to the subtree root in file tree
+ * @reloc_parent/slot:	pointer to the subtree root in reloc tree
+ *			BOTH POINTERS ARE BEFORE TREE SWAP
+ * @last_snapshot:	last snapshot generation of the file tree
+ */
+int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
+		struct btrfs_root *subv_root,
+		struct btrfs_block_group_cache *bg,
+		struct extent_buffer *subv_parent, int subv_slot,
+		struct extent_buffer *reloc_parent, int reloc_slot,
+		u64 last_snapshot)
+{
+	int level = btrfs_header_level(subv_parent) - 1;
+	struct btrfs_qgroup_swapped_blocks *blocks = &subv_root->swapped_blocks;
+	struct btrfs_fs_info *fs_info = subv_root->fs_info;
+	struct btrfs_qgroup_swapped_block *block;
+	struct rb_node **p = &blocks->blocks[level].rb_node;
+	struct rb_node *parent = NULL;
+	int ret = 0;
+
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+		return 0;
+
+	if (btrfs_node_ptr_generation(subv_parent, subv_slot) >
+		btrfs_node_ptr_generation(reloc_parent, reloc_slot)) {
+		btrfs_err_rl(fs_info,
+		"%s: bad parameter order, subv_gen=%llu reloc_gen=%llu",
+			__func__,
+			btrfs_node_ptr_generation(subv_parent, subv_slot),
+			btrfs_node_ptr_generation(reloc_parent, reloc_slot));
+		return -EUCLEAN;
+	}
+
+	block = kmalloc(sizeof(*block), GFP_NOFS);
+	if (!block) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * @reloc_parent/slot is still *BEFORE* swap, while @block is going to
+	 * record the bytenr *AFTER* swap, so we do the swap here.
+	 */
+	block->subv_bytenr = btrfs_node_blockptr(reloc_parent, reloc_slot);
+	block->subv_generation = btrfs_node_ptr_generation(reloc_parent,
+							   reloc_slot);
+	block->reloc_bytenr = btrfs_node_blockptr(subv_parent, subv_slot);
+	block->reloc_generation = btrfs_node_ptr_generation(subv_parent,
+							    subv_slot);
+	block->last_snapshot = last_snapshot;
+	block->level = level;
+	if (bg->flags & BTRFS_BLOCK_GROUP_DATA)
+		block->trace_leaf = true;
+	else
+		block->trace_leaf = false;
+	btrfs_node_key_to_cpu(reloc_parent, &block->first_key, reloc_slot);
+
+	/* Insert @block into @blocks */
+	spin_lock(&blocks->lock);
+	while (*p) {
+		struct btrfs_qgroup_swapped_block *entry;
+
+		parent = *p;
+		entry = rb_entry(parent, struct btrfs_qgroup_swapped_block,
+				 node);
+
+		if (entry->subv_bytenr < block->subv_bytenr)
+			p = &(*p)->rb_left;
+		else if (entry->subv_bytenr > block->subv_bytenr)
+			p = &(*p)->rb_right;
+		else {
+			if (entry->subv_generation != block->subv_generation ||
+			    entry->reloc_bytenr != block->reloc_bytenr ||
+			    entry->reloc_generation !=
+			    block->reloc_generation) {
+				WARN_ON_ONCE(1);
+				ret = -EEXIST;
+			}
+			kfree(block);
+			goto out_unlock;
+		}
+	}
+	rb_link_node(&block->node, parent, p);
+	rb_insert_color(&block->node, &blocks->blocks[level]);
+	blocks->swapped = true;
+out_unlock:
+	spin_unlock(&blocks->lock);
+out:
+	if (ret < 0)
+		fs_info->qgroup_flags |=
+			BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
+	return ret;
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 226956f0bd73..93d7f0998f03 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -6,6 +6,8 @@
 #ifndef BTRFS_QGROUP_H
 #define BTRFS_QGROUP_H
 
+#include <linux/spinlock.h>
+#include <linux/rbtree.h>
 #include "ulist.h"
 #include "delayed-ref.h"
 
@@ -37,6 +39,66 @@
  *    Normally at qgroup rescan and transaction commit time.
  */
 
+/*
+ * Special performance hack for balance.
+ *
+ * For balance, we need to swap subtree of file and reloc tree.
+ * In theory, we need to trace all subtree blocks of both file and reloc tree,
+ * since their owner has changed during such swap.
+ *
+ * However since balance has ensured that both subtrees are containing the
+ * same contents and have the same tree structures, such swap won't cause
+ * qgroup number change.
+ *
+ * But there is a race window between subtree swap and transaction commit,
+ * during that window, if we increase/decrease tree level or merge/split tree
+ * blocks, we still needs to trace original subtrees.
+ *
+ * So for balance, we use a delayed subtree trace, whose workflow is:
+ *
+ * 1) Record the subtree root block get swapped.
+ *
+ *    During subtree swap:
+ *    O = Old tree blocks
+ *    N = New tree blocks
+ *          reloc tree                         file tree X
+ *             Root                               Root
+ *            /    \                             /    \
+ *          NA     OB                          OA      OB
+ *        /  |     |  \                      /  |      |  \
+ *      NC  ND     OE  OF                   OC  OD     OE  OF
+ *
+ *   In these case, NA and OA is going to be swapped, record (NA, OA) into
+ *   file tree X.
+ *
+ * 2) After subtree swap.
+ *          reloc tree                         file tree X
+ *             Root                               Root
+ *            /    \                             /    \
+ *          OA     OB                          NA      OB
+ *        /  |     |  \                      /  |      |  \
+ *      OC  OD     OE  OF                   NC  ND     OE  OF
+ *
+ * 3a) CoW happens for OB
+ *     If we are going to CoW tree block OB, we check OB's bytenr against
+ *     tree X's swapped_blocks structure.
+ *     It doesn't fit any one, nothing will happen.
+ *
+ * 3b) CoW happens for NA
+ *     Check NA's bytenr against tree X's swapped_blocks, and get a hit.
+ *     Then we do subtree scan on both subtree OA and NA.
+ *     Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
+ *
+ *     Then no matter what we do to file tree X, qgroup numbers will
+ *     still be correct.
+ *     Then NA's record get removed from X's swapped_blocks.
+ *
+ * 4)  Transaction commit
+ *     Any record in X's swapped_blocks get removed, since there is no
+ *     modification to swapped subtrees, no need to trigger heavy qgroup
+ *     subtree rescan for them.
+ */
+
 /*
  * Record a dirty extent, and info qgroup to update quota on it
  * TODO: Use kmem cache to alloc it.
@@ -59,6 +121,24 @@ struct btrfs_qgroup_extent_record {
 	struct ulist *old_roots;
 };
 
+struct btrfs_qgroup_swapped_block {
+	struct rb_node node;
+
+	bool trace_leaf;
+	int level;
+
+	/* bytenr/generation of the tree block in subvolume tree after swap */
+	u64 subv_bytenr;
+	u64 subv_generation;
+
+	/* bytenr/generation of the tree block in reloc tree after swap */
+	u64 reloc_bytenr;
+	u64 reloc_generation;
+
+	u64 last_snapshot;
+	struct btrfs_key first_key;
+};
+
 /*
  * Qgroup reservation types:
  *
@@ -301,4 +381,23 @@ void btrfs_qgroup_convert_reserved_meta(struct btrfs_root *root, int num_bytes);
 
 void btrfs_qgroup_check_reserved_leak(struct inode *inode);
 
+/* btrfs_qgroup_swapped_blocks related functions */
+static inline void btrfs_qgroup_init_swapped_blocks(
+		struct btrfs_qgroup_swapped_blocks *swapped_blocks)
+{
+	int i;
+
+	spin_lock_init(&swapped_blocks->lock);
+	for (i = 0; i < BTRFS_MAX_LEVEL; i++)
+		swapped_blocks->blocks[i] = RB_ROOT;
+	swapped_blocks->swapped = false;
+}
+
+void btrfs_qgroup_clean_swapped_blocks(struct btrfs_root *root);
+int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
+		struct btrfs_root *subv_root,
+		struct btrfs_block_group_cache *bg,
+		struct extent_buffer *subv_parent, int subv_slot,
+		struct extent_buffer *reloc_parent, int reloc_slot,
+		u64 last_snapshot);
 #endif
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 1d5cfceb46c1..65f09b53a67d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1898,6 +1898,13 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 		if (ret < 0)
 			break;
 
+		btrfs_node_key_to_cpu(parent, &first_key, slot);
+		ret = btrfs_qgroup_add_swapped_blocks(trans, dest,
+				rc->block_group, parent, slot,
+				path->nodes[level], path->slots[level],
+				last_snapshot);
+		if (ret < 0)
+			break;
 		/*
 		 * swap blocks in fs tree and reloc tree.
 		 */
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 127fa1535f58..22b0dacae003 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -122,6 +122,7 @@ static noinline void switch_commit_roots(struct btrfs_transaction *trans)
 		if (is_fstree(root->root_key.objectid))
 			btrfs_unpin_free_ino(root);
 		clear_btree_io_tree(&root->dirty_log_pages);
+		btrfs_qgroup_clean_swapped_blocks(root);
 	}
 
 	/* We can free old roots now. */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 6/7] btrfs: qgroup: Use delayed subtree rescan for balance
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
                   ` (4 preceding siblings ...)
  2019-01-15  8:16 ` [PATCH v4 5/7] btrfs: qgroup: Introduce per-root swapped blocks infrastructure Qu Wenruo
@ 2019-01-15  8:16 ` Qu Wenruo
  2019-01-22 17:05   ` David Sterba
  2019-01-15  8:16 ` [PATCH v4 7/7] btrfs: qgroup: Cleanup old subtree swap code Qu Wenruo
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-01-15  8:16 UTC (permalink / raw)
  To: linux-btrfs

Before this patch, qgroup code trace the whole subtree of file and reloc
trees unconditionally.

This makes qgroup numbers consistent, but it could cause tons of
unnecessary extent trace, which cause a lot of overhead.

However for subtree swap of balance, since both subtree contains the
same content and tree structures, just swap them won't change qgroup
numbers.

It's the race window between subtree swap and transaction commit could
cause qgroup number change.

This patch will delay the qgroup subtree scan until CoW happens for the
subtree root.

So if there is no other operations for the fs, balance won't cause extra
qgroup overhead. (best case scenario)
And depends on the workload, most of the subtree scan can still be
avoided.

Only for worst case scenario, it will fall back to old subtree swap
overhead. (scan all swapped subtrees)

[[Benchmark]]
Hardware:
	VM 4G vRAM, 8 vCPUs,
	disk is using 'unsafe' cache mode,
	backing device is SAMSUNG 850 evo SSD.
	Host has 16G ram.

Mkfs parameter:
	--nodesize 4K (To bump up tree size)

Initial subvolume contents:
	4G data copied from /usr and /lib.
	(With enough regular small files)

Snapshots:
	16 snapshots of the original subvolume.
	each snapshot has 3 random files modified.

balance parameter:
	-m

So the content should be pretty similar to a real world root fs layout.

And after file system population, there is no other activity, so it
should be the best case scenario.

                     | v4.20-rc1            | w/ patchset    | diff
-----------------------------------------------------------------------
relocated extents    | 22615                | 22457          | -0.1%
qgroup dirty extents | 163457               | 121606         | -25.6%
time (sys)           | 22.884s              | 18.842s        | -17.6%
time (real)          | 27.724s              | 22.884s        | -17.5%

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.c      |  8 ++++
 fs/btrfs/qgroup.c     | 88 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h     |  2 +
 fs/btrfs/relocation.c | 14 +++----
 4 files changed, 103 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d92462fe66c8..ed28aa7c5f5c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -13,6 +13,7 @@
 #include "print-tree.h"
 #include "locking.h"
 #include "volumes.h"
+#include "qgroup.h"
 
 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
 		      *root, struct btrfs_path *path, int level);
@@ -1465,6 +1466,13 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 		btrfs_set_lock_blocking(parent);
 	btrfs_set_lock_blocking(buf);
 
+	/*
+	 * Before CoWing this block for later modification, check if it's
+	 * the subtree root and do the delayed subtree trace if needed.
+	 *
+	 * Also We don't care about the error, as it's handled internally.
+	 */
+	btrfs_qgroup_trace_subtree_after_cow(trans, root, buf);
 	ret = __btrfs_cow_block(trans, root, buf, parent,
 				 parent_slot, cow_ret, search_start, 0);
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 001830328960..7aaf2ee9c244 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3927,3 +3927,91 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
 			BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
 	return ret;
 }
+
+/*
+ * Check if the tree block is a subtree root, and if so do the needed
+ * delayed subtree trace for qgroup.
+ *
+ * This is called during btrfs_cow_block().
+ */
+int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
+					 struct btrfs_root *root,
+					 struct extent_buffer *subv_eb)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_qgroup_swapped_blocks *blocks = &root->swapped_blocks;
+	struct btrfs_qgroup_swapped_block *block;
+	struct extent_buffer *reloc_eb = NULL;
+	struct rb_node *n;
+	bool found = false;
+	bool swapped = false;
+	int level = btrfs_header_level(subv_eb);
+	int ret = 0;
+	int i;
+
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+		return 0;
+	if (!is_fstree(root->root_key.objectid) || !root->reloc_root)
+		return 0;
+
+	spin_lock(&blocks->lock);
+	if (!blocks->swapped) {
+		spin_unlock(&blocks->lock);
+		goto out;
+	}
+	n = blocks->blocks[level].rb_node;
+
+	while (n) {
+		block = rb_entry(n, struct btrfs_qgroup_swapped_block, node);
+		if (block->subv_bytenr < subv_eb->start)
+			n = n->rb_left;
+		else if (block->subv_bytenr > subv_eb->start)
+			n = n->rb_right;
+		else {
+			found = true;
+			break;
+		}
+	}
+	if (!found) {
+		spin_unlock(&blocks->lock);
+		goto out;
+	}
+	/* Found one, remove it from @blocks first and update blocks->swapped */
+	rb_erase(&block->node, &blocks->blocks[level]);
+	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
+		if (RB_EMPTY_ROOT(&blocks->blocks[i])) {
+			swapped = true;
+			break;
+		}
+	}
+	blocks->swapped = swapped;
+	spin_unlock(&blocks->lock);
+
+	/* Read out reloc subtree root */
+	reloc_eb = read_tree_block(fs_info, block->reloc_bytenr,
+				   block->reloc_generation, block->level,
+				   &block->first_key);
+	if (IS_ERR(reloc_eb)) {
+		ret = PTR_ERR(subv_eb);
+		reloc_eb = NULL;
+		goto free_out;
+	}
+	if (!extent_buffer_uptodate(reloc_eb)) {
+		ret = -EIO;
+		goto free_out;
+	}
+
+	ret = qgroup_trace_subtree_swap(trans, reloc_eb, subv_eb,
+			block->last_snapshot, block->trace_leaf);
+free_out:
+	kfree(block);
+	free_extent_buffer(reloc_eb);
+out:
+	if (ret < 0) {
+		btrfs_err_rl(fs_info,
+			     "failed to account subtree at bytenr %llu: %d",
+			     subv_eb->start, ret);
+		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
+	}
+	return ret;
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 93d7f0998f03..9c69f20f7c0d 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -400,4 +400,6 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
 		struct extent_buffer *subv_parent, int subv_slot,
 		struct extent_buffer *reloc_parent, int reloc_slot,
 		u64 last_snapshot);
+int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
+		struct btrfs_root *root, struct extent_buffer *eb);
 #endif
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 65f09b53a67d..32dfe240826d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1889,16 +1889,12 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 		 *    If not traced, we will leak data numbers
 		 * 2) Fs subtree
 		 *    If not traced, we will double count old data
-		 *    and tree block numbers, if current trans doesn't free
-		 *    data reloc tree inode.
+		 *
+		 * We don't scan the subtree right now, but only record
+		 * the swapped tree blocks.
+		 * The real subtree rescan is delayed until we have new
+		 * CoW on the subtree root node before transaction commit.
 		 */
-		ret = btrfs_qgroup_trace_subtree_swap(trans, rc->block_group,
-				parent, slot, path->nodes[level],
-				path->slots[level], last_snapshot);
-		if (ret < 0)
-			break;
-
-		btrfs_node_key_to_cpu(parent, &first_key, slot);
 		ret = btrfs_qgroup_add_swapped_blocks(trans, dest,
 				rc->block_group, parent, slot,
 				path->nodes[level], path->slots[level],
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 7/7] btrfs: qgroup: Cleanup old subtree swap code
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
                   ` (5 preceding siblings ...)
  2019-01-15  8:16 ` [PATCH v4 6/7] btrfs: qgroup: Use delayed subtree rescan for balance Qu Wenruo
@ 2019-01-15  8:16 ` Qu Wenruo
  2019-01-15 17:26 ` [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Josef Bacik
  2019-01-22 16:28 ` David Sterba
  8 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2019-01-15  8:16 UTC (permalink / raw)
  To: linux-btrfs

Since it's replaced by new delayed subtree swap code, remove the
original code.

The cleanup is small since most of its core function is still used by
delayed subtree swap trace.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 94 -----------------------------------------------
 fs/btrfs/qgroup.h |  6 ---
 2 files changed, 100 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 7aaf2ee9c244..284df1101a5c 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2048,100 +2048,6 @@ static int qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-/*
- * Inform qgroup to trace subtree swap used in balance.
- *
- * Unlike btrfs_qgroup_trace_subtree(), this function will only trace
- * new tree blocks whose generation is equal to (or larger than) @last_snapshot.
- *
- * Will go down the tree block pointed by @dst_eb (pointed by @dst_parent and
- * @dst_slot), and find any tree blocks whose generation is at @last_snapshot,
- * and then go down @src_eb (pointed by @src_parent and @src_slot) to find
- * the counterpart of the tree block, then mark both tree blocks as qgroup dirty,
- * and skip all tree blocks whose generation is smaller than last_snapshot.
- *
- * This would skip tons of tree blocks of original btrfs_qgroup_trace_subtree(),
- * which could be the cause of very slow balance if the file tree is large.
- *
- * @src_parent, @src_slot: pointer to src (file tree) eb.
- * @dst_parent, @dst_slot: pointer to dst (reloc tree) eb.
- */
-int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
-				struct btrfs_block_group_cache *bg_cache,
-				struct extent_buffer *src_parent, int src_slot,
-				struct extent_buffer *dst_parent, int dst_slot,
-				u64 last_snapshot)
-{
-	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_key first_key;
-	struct extent_buffer *src_eb = NULL;
-	struct extent_buffer *dst_eb = NULL;
-	bool trace_leaf = false;
-	u64 child_gen;
-	u64 child_bytenr;
-	int ret;
-
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
-		return 0;
-
-	/* Check parameter order */
-	if (btrfs_node_ptr_generation(src_parent, src_slot) >
-	    btrfs_node_ptr_generation(dst_parent, dst_slot)) {
-		btrfs_err_rl(fs_info,
-		"%s: bad parameter order, src_gen=%llu dst_gen=%llu", __func__,
-			btrfs_node_ptr_generation(src_parent, src_slot),
-			btrfs_node_ptr_generation(dst_parent, dst_slot));
-		return -EUCLEAN;
-	}
-
-	/*
-	 * Only trace leaf if we're relocating data block groups, this could
-	 * reduce tons of data extents tracing for meta/sys bg relocation.
-	 */
-	if (bg_cache->flags & BTRFS_BLOCK_GROUP_DATA)
-		trace_leaf = true;
-	/* Read out real @src_eb, pointed by @src_parent and @src_slot */
-	child_bytenr = btrfs_node_blockptr(src_parent, src_slot);
-	child_gen = btrfs_node_ptr_generation(src_parent, src_slot);
-	btrfs_node_key_to_cpu(src_parent, &first_key, src_slot);
-
-	src_eb = read_tree_block(fs_info, child_bytenr, child_gen,
-			btrfs_header_level(src_parent) - 1, &first_key);
-	if (IS_ERR(src_eb)) {
-		ret = PTR_ERR(src_eb);
-		goto out;
-	}
-
-	/* Read out real @dst_eb, pointed by @src_parent and @src_slot */
-	child_bytenr = btrfs_node_blockptr(dst_parent, dst_slot);
-	child_gen = btrfs_node_ptr_generation(dst_parent, dst_slot);
-	btrfs_node_key_to_cpu(dst_parent, &first_key, dst_slot);
-
-	dst_eb = read_tree_block(fs_info, child_bytenr, child_gen,
-			btrfs_header_level(dst_parent) - 1, &first_key);
-	if (IS_ERR(dst_eb)) {
-		ret = PTR_ERR(dst_eb);
-		goto out;
-	}
-
-	if (!extent_buffer_uptodate(src_eb) || !extent_buffer_uptodate(dst_eb)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
-	/* Do the generation aware breadth-first search */
-	ret = qgroup_trace_subtree_swap(trans, src_eb, dst_eb, last_snapshot,
-					trace_leaf);
-	if (ret < 0)
-		goto out;
-	ret = 0;
-
-out:
-	free_extent_buffer(src_eb);
-	free_extent_buffer(dst_eb);
-	return ret;
-}
-
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level)
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 9c69f20f7c0d..ff96f5678de5 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -301,12 +301,6 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
 int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			       struct extent_buffer *root_eb,
 			       u64 root_gen, int root_level);
-
-int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
-				struct btrfs_block_group_cache *bg_cache,
-				struct extent_buffer *src_parent, int src_slot,
-				struct extent_buffer *dst_parent, int dst_slot,
-				u64 last_snapshot);
 int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 				u64 num_bytes, struct ulist *old_roots,
 				struct ulist *new_roots);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
                   ` (6 preceding siblings ...)
  2019-01-15  8:16 ` [PATCH v4 7/7] btrfs: qgroup: Cleanup old subtree swap code Qu Wenruo
@ 2019-01-15 17:26 ` Josef Bacik
  2019-01-16  0:31   ` Qu Wenruo
  2019-01-22 16:28 ` David Sterba
  8 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2019-01-15 17:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
> 
> Which is based on v5.0-rc1.
> 

I've been testing these patches hoping to get rid of the qgroup deadlock that
these patches are supposed to fix, but instead they make the box completely
unusable with 100% cpu usage for minutes at a time at every transaction commit.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-15 17:26 ` [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Josef Bacik
@ 2019-01-16  0:31   ` Qu Wenruo
  2019-01-16  1:07     ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-01-16  0:31 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1002 bytes --]



On 2019/1/16 上午1:26, Josef Bacik wrote:
> On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
>>
>> Which is based on v5.0-rc1.
>>
> 
> I've been testing these patches hoping to get rid of the qgroup deadlock that
> these patches are supposed to fix, but instead they make the box completely
> unusable with 100% cpu usage for minutes at a time at every transaction commit.

I'm afraid it's related to the v5.0-rc1 base, not the patchset itself.

Just try to balance metadata with 16 snapshots, you'll see btrfs bumping
its generation like crazy, no matter if quota is enabled or not.

And since btrfs is committing transaction like crazy, no wonder it will
do tons of qgroup accounting.

My bisect leads to commit 64403612b73a94bc7b02cf8ca126e3b8ced6e921
btrfs: rework btrfs_check_space_for_delayed_refs.

Thanks,
Qu

> Thanks,
> 
> Josef
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-16  0:31   ` Qu Wenruo
@ 2019-01-16  1:07     ` Qu Wenruo
  2019-01-16  1:15       ` Josef Bacik
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-01-16  1:07 UTC (permalink / raw)
  To: Qu Wenruo, Josef Bacik; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1444 bytes --]



On 2019/1/16 上午8:31, Qu Wenruo wrote:
> 
> 
> On 2019/1/16 上午1:26, Josef Bacik wrote:
>> On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
>>>
>>> Which is based on v5.0-rc1.
>>>
>>
>> I've been testing these patches hoping to get rid of the qgroup deadlock that
>> these patches are supposed to fix, but instead they make the box completely
>> unusable with 100% cpu usage for minutes at a time at every transaction commit.
> 
> I'm afraid it's related to the v5.0-rc1 base, not the patchset itself.
> 
> Just try to balance metadata with 16 snapshots, you'll see btrfs bumping
> its generation like crazy, no matter if quota is enabled or not.
> 
> And since btrfs is committing transaction like crazy, no wonder it will
> do tons of qgroup accounting.
> 
> My bisect leads to commit 64403612b73a94bc7b02cf8ca126e3b8ced6e921
> btrfs: rework btrfs_check_space_for_delayed_refs.

Furthermore, you could try this RFC test case to see the problem.
https://patchwork.kernel.org/patch/10761715/

It would only take around 100s for v4.20 but over 500 for v5.0-rc1 with
tons of unnecessary transaction committed for nothing, no quota enabled.

So I'm afraid that commit is blocking my qgroup patchset.

Thanks,
Qu


> 
> Thanks,
> Qu
> 
>> Thanks,
>>
>> Josef
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-16  1:07     ` Qu Wenruo
@ 2019-01-16  1:15       ` Josef Bacik
  2019-01-16  1:29         ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2019-01-16  1:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Josef Bacik, linux-btrfs

On Wed, Jan 16, 2019 at 09:07:26AM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/1/16 上午8:31, Qu Wenruo wrote:
> > 
> > 
> > On 2019/1/16 上午1:26, Josef Bacik wrote:
> >> On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from github:
> >>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
> >>>
> >>> Which is based on v5.0-rc1.
> >>>
> >>
> >> I've been testing these patches hoping to get rid of the qgroup deadlock that
> >> these patches are supposed to fix, but instead they make the box completely
> >> unusable with 100% cpu usage for minutes at a time at every transaction commit.
> > 
> > I'm afraid it's related to the v5.0-rc1 base, not the patchset itself.
> > 
> > Just try to balance metadata with 16 snapshots, you'll see btrfs bumping
> > its generation like crazy, no matter if quota is enabled or not.
> > 
> > And since btrfs is committing transaction like crazy, no wonder it will
> > do tons of qgroup accounting.
> > 
> > My bisect leads to commit 64403612b73a94bc7b02cf8ca126e3b8ced6e921
> > btrfs: rework btrfs_check_space_for_delayed_refs.
> 
> Furthermore, you could try this RFC test case to see the problem.
> https://patchwork.kernel.org/patch/10761715/
> 
> It would only take around 100s for v4.20 but over 500 for v5.0-rc1 with
> tons of unnecessary transaction committed for nothing, no quota enabled.
> 
> So I'm afraid that commit is blocking my qgroup patchset.
> 

I've fixed the balance problem, it took 2 seconds to figure out, I'm just
waiting on xfstests to finish running.

And your patch making things worse has nothing to do with that problem.  Our
test doesn't run balance, so the issue you reported has nothing to do with the
fact that your patch makes our boxes unusable with qgroups on.

The problem is with your deadlock avoidence patch we're now making a lot more
dirty extents to run in the critical section of the transaction commit.  Also
because we're no longer pre-fetching the old roots, we're doing the old roots
and new roots lookup inside the critical section, so now each dirty extent takes
2x as long.  With my basic test it was taking 5 minutes to do the qgroup
accounting, which keeps the box from booting essentially.

Without your patch it's still awful because btrfs-cleaner just sits there at
100% while deleting snapshots, but at least it's not making the whole system
stop running while it does all that work in the transaction commit.

And if you had done the proper root cause analysis you would have noticed that
we're taking tree locks in the find_parent_nodes() case when we're searching the
commit root, something we shouldn't be doing.  So all that really needs to be
done is to check if (!path->skip_locking) btrfs_tree_read_lock(eb); in those
cases and the deadlock goes away.  Because no matter what we shouldn't be taking
locks when we're not given a trans in the backref lookup code.  So the fact that
we were doing just that and thus deadlocking should have been a red flag.

I will be sending these patches in the morning, once all of the various testing
that should take place on patches is complete.  The balance patches you have for
qgroups don't appear to be a problem, but that deadlock one is bogus and needs
to be dropped.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-16  1:15       ` Josef Bacik
@ 2019-01-16  1:29         ` Qu Wenruo
  2019-01-16  1:34           ` Josef Bacik
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-01-16  1:29 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3749 bytes --]



On 2019/1/16 上午9:15, Josef Bacik wrote:
> On Wed, Jan 16, 2019 at 09:07:26AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/1/16 上午8:31, Qu Wenruo wrote:
>>>
>>>
>>> On 2019/1/16 上午1:26, Josef Bacik wrote:
>>>> On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
>>>>> This patchset can be fetched from github:
>>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
>>>>>
>>>>> Which is based on v5.0-rc1.
>>>>>
>>>>
>>>> I've been testing these patches hoping to get rid of the qgroup deadlock that
>>>> these patches are supposed to fix, but instead they make the box completely
>>>> unusable with 100% cpu usage for minutes at a time at every transaction commit.
>>>
>>> I'm afraid it's related to the v5.0-rc1 base, not the patchset itself.
>>>
>>> Just try to balance metadata with 16 snapshots, you'll see btrfs bumping
>>> its generation like crazy, no matter if quota is enabled or not.
>>>
>>> And since btrfs is committing transaction like crazy, no wonder it will
>>> do tons of qgroup accounting.
>>>
>>> My bisect leads to commit 64403612b73a94bc7b02cf8ca126e3b8ced6e921
>>> btrfs: rework btrfs_check_space_for_delayed_refs.
>>
>> Furthermore, you could try this RFC test case to see the problem.
>> https://patchwork.kernel.org/patch/10761715/
>>
>> It would only take around 100s for v4.20 but over 500 for v5.0-rc1 with
>> tons of unnecessary transaction committed for nothing, no quota enabled.
>>
>> So I'm afraid that commit is blocking my qgroup patchset.
>>
> 
> I've fixed the balance problem, it took 2 seconds to figure out, I'm just
> waiting on xfstests to finish running.
> 
> And your patch making things worse has nothing to do with that problem.  Our
> test doesn't run balance, so the issue you reported has nothing to do with the
> fact that your patch makes our boxes unusable with qgroups on.
> 
> The problem is with your deadlock avoidence patch we're now making a lot more
> dirty extents to run in the critical section of the transaction commit.  Also
> because we're no longer pre-fetching the old roots, we're doing the old roots
> and new roots lookup inside the critical section, so now each dirty extent takes
> 2x as long.  With my basic test it was taking 5 minutes to do the qgroup
> accounting, which keeps the box from booting essentially.
> 
> Without your patch it's still awful because btrfs-cleaner just sits there at
> 100% while deleting snapshots, but at least it's not making the whole system
> stop running while it does all that work in the transaction commit.
> 
> And if you had done the proper root cause analysis you would have noticed that
> we're taking tree locks in the find_parent_nodes() case when we're searching the
> commit root, something we shouldn't be doing.  So all that really needs to be
> done is to check if (!path->skip_locking) btrfs_tree_read_lock(eb); in those
> cases and the deadlock goes away.  Because no matter what we shouldn't be taking
> locks when we're not given a trans in the backref lookup code.

That indeed looks much better than my current solution.

Although I'm not 100% sure for cases like tree blocks shared between
commit and current root (tree block not modified yet).

I'll definitely invest more time to try to fix this bug.

Thanks,
Qu

>  So the fact that
> we were doing just that and thus deadlocking should have been a red flag.
> 
> I will be sending these patches in the morning, once all of the various testing
> that should take place on patches is complete.  The balance patches you have for
> qgroups don't appear to be a problem, but that deadlock one is bogus and needs
> to be dropped.  Thanks,
> 
> Josef
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-16  1:29         ` Qu Wenruo
@ 2019-01-16  1:34           ` Josef Bacik
  2019-01-16  1:36             ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2019-01-16  1:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Josef Bacik, Qu Wenruo, linux-btrfs

On Wed, Jan 16, 2019 at 09:29:29AM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/1/16 上午9:15, Josef Bacik wrote:
> > On Wed, Jan 16, 2019 at 09:07:26AM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2019/1/16 上午8:31, Qu Wenruo wrote:
> >>>
> >>>
> >>> On 2019/1/16 上午1:26, Josef Bacik wrote:
> >>>> On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
> >>>>> This patchset can be fetched from github:
> >>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
> >>>>>
> >>>>> Which is based on v5.0-rc1.
> >>>>>
> >>>>
> >>>> I've been testing these patches hoping to get rid of the qgroup deadlock that
> >>>> these patches are supposed to fix, but instead they make the box completely
> >>>> unusable with 100% cpu usage for minutes at a time at every transaction commit.
> >>>
> >>> I'm afraid it's related to the v5.0-rc1 base, not the patchset itself.
> >>>
> >>> Just try to balance metadata with 16 snapshots, you'll see btrfs bumping
> >>> its generation like crazy, no matter if quota is enabled or not.
> >>>
> >>> And since btrfs is committing transaction like crazy, no wonder it will
> >>> do tons of qgroup accounting.
> >>>
> >>> My bisect leads to commit 64403612b73a94bc7b02cf8ca126e3b8ced6e921
> >>> btrfs: rework btrfs_check_space_for_delayed_refs.
> >>
> >> Furthermore, you could try this RFC test case to see the problem.
> >> https://patchwork.kernel.org/patch/10761715/
> >>
> >> It would only take around 100s for v4.20 but over 500 for v5.0-rc1 with
> >> tons of unnecessary transaction committed for nothing, no quota enabled.
> >>
> >> So I'm afraid that commit is blocking my qgroup patchset.
> >>
> > 
> > I've fixed the balance problem, it took 2 seconds to figure out, I'm just
> > waiting on xfstests to finish running.
> > 
> > And your patch making things worse has nothing to do with that problem.  Our
> > test doesn't run balance, so the issue you reported has nothing to do with the
> > fact that your patch makes our boxes unusable with qgroups on.
> > 
> > The problem is with your deadlock avoidence patch we're now making a lot more
> > dirty extents to run in the critical section of the transaction commit.  Also
> > because we're no longer pre-fetching the old roots, we're doing the old roots
> > and new roots lookup inside the critical section, so now each dirty extent takes
> > 2x as long.  With my basic test it was taking 5 minutes to do the qgroup
> > accounting, which keeps the box from booting essentially.
> > 
> > Without your patch it's still awful because btrfs-cleaner just sits there at
> > 100% while deleting snapshots, but at least it's not making the whole system
> > stop running while it does all that work in the transaction commit.
> > 
> > And if you had done the proper root cause analysis you would have noticed that
> > we're taking tree locks in the find_parent_nodes() case when we're searching the
> > commit root, something we shouldn't be doing.  So all that really needs to be
> > done is to check if (!path->skip_locking) btrfs_tree_read_lock(eb); in those
> > cases and the deadlock goes away.  Because no matter what we shouldn't be taking
> > locks when we're not given a trans in the backref lookup code.
> 
> That indeed looks much better than my current solution.
> 
> Although I'm not 100% sure for cases like tree blocks shared between
> commit and current root (tree block not modified yet).

Doesn't matter, the block can't be modified so we don't need the locking,  If
the current root needs to change it then it cow's it and messes with the new eb,
not the one attached to the commit root.

> 
> I'll definitely invest more time to try to fix this bug.
> 

Don't worry about it, I'm already running the patch through my tests, I'll send
them in the morning when the tests are finished.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-16  1:34           ` Josef Bacik
@ 2019-01-16  1:36             ` Qu Wenruo
  0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2019-01-16  1:36 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4056 bytes --]



On 2019/1/16 上午9:34, Josef Bacik wrote:
> On Wed, Jan 16, 2019 at 09:29:29AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/1/16 上午9:15, Josef Bacik wrote:
>>> On Wed, Jan 16, 2019 at 09:07:26AM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2019/1/16 上午8:31, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2019/1/16 上午1:26, Josef Bacik wrote:
>>>>>> On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
>>>>>>> This patchset can be fetched from github:
>>>>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
>>>>>>>
>>>>>>> Which is based on v5.0-rc1.
>>>>>>>
>>>>>>
>>>>>> I've been testing these patches hoping to get rid of the qgroup deadlock that
>>>>>> these patches are supposed to fix, but instead they make the box completely
>>>>>> unusable with 100% cpu usage for minutes at a time at every transaction commit.
>>>>>
>>>>> I'm afraid it's related to the v5.0-rc1 base, not the patchset itself.
>>>>>
>>>>> Just try to balance metadata with 16 snapshots, you'll see btrfs bumping
>>>>> its generation like crazy, no matter if quota is enabled or not.
>>>>>
>>>>> And since btrfs is committing transaction like crazy, no wonder it will
>>>>> do tons of qgroup accounting.
>>>>>
>>>>> My bisect leads to commit 64403612b73a94bc7b02cf8ca126e3b8ced6e921
>>>>> btrfs: rework btrfs_check_space_for_delayed_refs.
>>>>
>>>> Furthermore, you could try this RFC test case to see the problem.
>>>> https://patchwork.kernel.org/patch/10761715/
>>>>
>>>> It would only take around 100s for v4.20 but over 500 for v5.0-rc1 with
>>>> tons of unnecessary transaction committed for nothing, no quota enabled.
>>>>
>>>> So I'm afraid that commit is blocking my qgroup patchset.
>>>>
>>>
>>> I've fixed the balance problem, it took 2 seconds to figure out, I'm just
>>> waiting on xfstests to finish running.
>>>
>>> And your patch making things worse has nothing to do with that problem.  Our
>>> test doesn't run balance, so the issue you reported has nothing to do with the
>>> fact that your patch makes our boxes unusable with qgroups on.
>>>
>>> The problem is with your deadlock avoidence patch we're now making a lot more
>>> dirty extents to run in the critical section of the transaction commit.  Also
>>> because we're no longer pre-fetching the old roots, we're doing the old roots
>>> and new roots lookup inside the critical section, so now each dirty extent takes
>>> 2x as long.  With my basic test it was taking 5 minutes to do the qgroup
>>> accounting, which keeps the box from booting essentially.
>>>
>>> Without your patch it's still awful because btrfs-cleaner just sits there at
>>> 100% while deleting snapshots, but at least it's not making the whole system
>>> stop running while it does all that work in the transaction commit.
>>>
>>> And if you had done the proper root cause analysis you would have noticed that
>>> we're taking tree locks in the find_parent_nodes() case when we're searching the
>>> commit root, something we shouldn't be doing.  So all that really needs to be
>>> done is to check if (!path->skip_locking) btrfs_tree_read_lock(eb); in those
>>> cases and the deadlock goes away.  Because no matter what we shouldn't be taking
>>> locks when we're not given a trans in the backref lookup code.
>>
>> That indeed looks much better than my current solution.
>>
>> Although I'm not 100% sure for cases like tree blocks shared between
>> commit and current root (tree block not modified yet).
> 
> Doesn't matter, the block can't be modified so we don't need the locking,  If
> the current root needs to change it then it cow's it and messes with the new eb,
> not the one attached to the commit root.
> 
>>
>> I'll definitely invest more time to try to fix this bug.
>>
> 
> Don't worry about it, I'm already running the patch through my tests, I'll send
> them in the morning when the tests are finished.  Thanks,

I can't be more happier dropping that deadlock fix patch.

Thanks,
Qu

> 
> Josef
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
                   ` (7 preceding siblings ...)
  2019-01-15 17:26 ` [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Josef Bacik
@ 2019-01-22 16:28 ` David Sterba
  2019-01-23  5:43   ` Qu Wenruo
  8 siblings, 1 reply; 24+ messages in thread
From: David Sterba @ 2019-01-22 16:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
> 
> Which is based on v5.0-rc1.
> 
> This patch address the heavy load subtree scan, but delaying it until
> we're going to modify the swapped tree block.

I'd like to merge this patchset, the overall logic is fine but I've
found lots of coding style issues, too many to fix them just myself.

> v4:
> - Renaming members from "file_*" to "subv_*".
>   Members like "file_bytenr" is pretty confusing, renaming it to
>   "subv_bytenr" avoid the confusion.

The naming change is incomplete, there are many more references to 'file
tree' in comments and changelogs, the naming should be unified
everywhere. Also I'd prefer to use 'subvol' though it's a bit longer
than 'subv', the former seems to be the most common form.

I'll comment the patches with the rest.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 3/7] btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots()
  2019-01-15  8:16 ` [PATCH v4 3/7] btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots() Qu Wenruo
@ 2019-01-22 16:32   ` David Sterba
  2019-01-23  6:01     ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: David Sterba @ 2019-01-22 16:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jan 15, 2019 at 04:16:00PM +0800, Qu Wenruo wrote:
> And to co-operate this, also delayed btrfs_drop_snapshot() call on reloc
> tree, btrfs_drop_snapshot() call will also be delayed to
> clean_dirty_subvs().

Can you please rephrase this paragraph?

> This patch will increase the size of btrfs_root by 16 bytes.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

> +static int clean_dirty_subvs(struct reloc_control *rc)
> +{
> +	struct btrfs_root *root;
> +	struct btrfs_root *next;
> +	int err = 0;
> +	int ret;
> +
> +	list_for_each_entry_safe(root, next, &rc->dirty_subv_roots,
> +				 reloc_dirty_list) {
> +		struct btrfs_root *reloc_root = root->reloc_root;
> +
> +		clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
> +		list_del_init(&root->reloc_dirty_list);
> +		root->reloc_root = NULL;
> +		if (reloc_root) {
> +			ret = btrfs_drop_snapshot(reloc_root, NULL, 0, 1);
> +			if (ret < 0 && !err)
> +				err = ret;
> +		}
> +		btrfs_put_fs_root(root);
> +	}
> +	return err;

Please dont use the err/ret style but 'ret' that matches function return
type and for the temporary return values ret2 etc.

> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 4/7] btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap()
  2019-01-15  8:16 ` [PATCH v4 4/7] btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap() Qu Wenruo
@ 2019-01-22 16:38   ` David Sterba
  0 siblings, 0 replies; 24+ messages in thread
From: David Sterba @ 2019-01-22 16:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jan 15, 2019 at 04:16:01PM +0800, Qu Wenruo wrote:
> Refactor btrfs_qgroup_trace_subtree_swap() into
> qgroup_trace_subtree_swap(), which only needs two extent buffer and some
> other bool to control the behavior.
> 
> This provides the basis for later delayed subtree scan work.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/qgroup.c | 78 ++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 57 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index ba30adac88a7..8fe6ebe9aef8 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1994,6 +1994,60 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
>  	return ret;
>  }
>  
> +static int qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
> +				struct extent_buffer *src_eb,
> +				struct extent_buffer *dst_eb,
> +				u64 last_snapshot, bool trace_leaf)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_path *dst_path = NULL;
> +	int level;
> +	int ret;
> +
> +	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +		return 0;
> +
> +	/* Wrong parameter order */
> +	if (btrfs_header_generation(src_eb) > btrfs_header_generation(dst_eb)) {
> +		btrfs_err_rl(fs_info,
> +		"%s: bad parameter order, src_gen=%llu dst_gen=%llu", __func__,
> +			     btrfs_header_generation(src_eb),
> +			     btrfs_header_generation(dst_eb));
> +		return -EUCLEAN;
> +	}
> +
> +	if (!extent_buffer_uptodate(src_eb) ||
> +	    !extent_buffer_uptodate(dst_eb)) {
> +		ret = -EINVAL;

Why is this EINVAL and not EIO or EUCLEAN? There's a similar pattern in
btrfs_qgroup_trace_subtree_swap that also does not look correct.

> +		goto out;
> +	}
> +
> +	level = btrfs_header_level(dst_eb);
> +	dst_path = btrfs_alloc_path();
> +	if (!dst_path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	/* For dst_path */
> +	extent_buffer_get(dst_eb);
> +	dst_path->nodes[level] = dst_eb;
> +	dst_path->slots[level] = 0;
> +	dst_path->locks[level] = 0;
> +
> +	/* Do the generation aware breadth-first search */
> +	ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
> +					      level, last_snapshot, trace_leaf);
> +	if (ret < 0)
> +		goto out;
> +	ret = 0;
> +
> +out:
> +	btrfs_free_path(dst_path);
> +	if (ret < 0)
> +		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 5/7] btrfs: qgroup: Introduce per-root swapped blocks infrastructure
  2019-01-15  8:16 ` [PATCH v4 5/7] btrfs: qgroup: Introduce per-root swapped blocks infrastructure Qu Wenruo
@ 2019-01-22 16:55   ` David Sterba
  2019-01-22 23:07     ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: David Sterba @ 2019-01-22 16:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jan 15, 2019 at 04:16:02PM +0800, Qu Wenruo wrote:
> +/*
> + * Record swapped tree blocks of a file/subvolume tree for delayed subtree
> + * trace code. For detail check comment in fs/btrfs/qgroup.c.
> + */
> +struct btrfs_qgroup_swapped_blocks {
> +	spinlock_t lock;
> +	struct rb_root blocks[BTRFS_MAX_LEVEL];
> +	/* RM_EMPTY_ROOT() of above blocks[] */
> +	bool swapped;

Reordering 'swapped' after the lock will save at least 8 bytes of the
structure size due to padding.

> +};
> +
>  /*
>   * in ram representation of the tree.  extent_root is used for all allocations
>   * and for the extent tree extent_root root.
> @@ -1339,6 +1350,9 @@ struct btrfs_root {
>  	/* Number of active swapfiles */
>  	atomic_t nr_swapfiles;
>  
> +	/* Record pairs of swapped blocks for qgroup */
> +	struct btrfs_qgroup_swapped_blocks swapped_blocks;
> +
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>  	u64 alloc_bytenr;
>  #endif
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index bfefa1de0455..31b2facdfc1e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1219,6 +1219,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>  	root->anon_dev = 0;
>  
>  	spin_lock_init(&root->root_item_lock);
> +	btrfs_qgroup_init_swapped_blocks(&root->swapped_blocks);
>  }
>  
>  static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info,
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 8fe6ebe9aef8..001830328960 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -3797,3 +3797,133 @@ void btrfs_qgroup_check_reserved_leak(struct inode *inode)
>  	}
>  	extent_changeset_release(&changeset);
>  }
> +
> +/*
> + * Delete all swapped blocks record of @root.
> + * Every record here means we skipped a full subtree scan for qgroup.
> + *
> + * Get called when commit one transaction.
> + */
> +void btrfs_qgroup_clean_swapped_blocks(struct btrfs_root *root)
> +{
> +	struct btrfs_qgroup_swapped_blocks *swapped_blocks;
> +	int i;
> +
> +	swapped_blocks = &root->swapped_blocks;
> +
> +	spin_lock(&swapped_blocks->lock);
> +	if (!swapped_blocks->swapped)
> +		goto out;
> +	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
> +		struct rb_root *cur_root = &swapped_blocks->blocks[i];
> +		struct btrfs_qgroup_swapped_block *entry;
> +		struct btrfs_qgroup_swapped_block *next;
> +
> +		rbtree_postorder_for_each_entry_safe(entry, next, cur_root,
> +						     node)
> +			kfree(entry);
> +		swapped_blocks->blocks[i] = RB_ROOT;
> +	}
> +	swapped_blocks->swapped = false;
> +out:
> +	spin_unlock(&swapped_blocks->lock);
> +}
> +
> +/*
> + * Adding subtree roots record into @subv_root.
> + *
> + * @subv_root:		tree root of the subvolume tree get swapped
> + * @bg:			block group under balance
> + * @subv_parent/slot:	pointer to the subtree root in file tree
> + * @reloc_parent/slot:	pointer to the subtree root in reloc tree
> + *			BOTH POINTERS ARE BEFORE TREE SWAP
> + * @last_snapshot:	last snapshot generation of the file tree
> + */
> +int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
> +		struct btrfs_root *subv_root,
> +		struct btrfs_block_group_cache *bg,
> +		struct extent_buffer *subv_parent, int subv_slot,
> +		struct extent_buffer *reloc_parent, int reloc_slot,
> +		u64 last_snapshot)
> +{
> +	int level = btrfs_header_level(subv_parent) - 1;
> +	struct btrfs_qgroup_swapped_blocks *blocks = &subv_root->swapped_blocks;
> +	struct btrfs_fs_info *fs_info = subv_root->fs_info;
> +	struct btrfs_qgroup_swapped_block *block;
> +	struct rb_node **p = &blocks->blocks[level].rb_node;

Please rename 'p' to somethign that's not single letter and move the
initialization before the block that uses it.

> +	struct rb_node *parent = NULL;
> +	int ret = 0;
> +
> +	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +		return 0;
> +
> +	if (btrfs_node_ptr_generation(subv_parent, subv_slot) >
> +		btrfs_node_ptr_generation(reloc_parent, reloc_slot)) {
> +		btrfs_err_rl(fs_info,
> +		"%s: bad parameter order, subv_gen=%llu reloc_gen=%llu",
> +			__func__,
> +			btrfs_node_ptr_generation(subv_parent, subv_slot),
> +			btrfs_node_ptr_generation(reloc_parent, reloc_slot));
> +		return -EUCLEAN;
> +	}
> +
> +	block = kmalloc(sizeof(*block), GFP_NOFS);
> +	if (!block) {
> +		ret = -ENOMEM;

This goes to 'out' that marks quota as inconsistent, is this
intentional? Ie. if this function does not succed for whatever reason,
the quotas are indeed inconsistent.

> +		goto out;
> +	}
> +
> +	/*
> +	 * @reloc_parent/slot is still *BEFORE* swap, while @block is going to
> +	 * record the bytenr *AFTER* swap, so we do the swap here.

You don't need to *EMPHASIZE* that much

> +	 */
> +	block->subv_bytenr = btrfs_node_blockptr(reloc_parent, reloc_slot);
> +	block->subv_generation = btrfs_node_ptr_generation(reloc_parent,
> +							   reloc_slot);
> +	block->reloc_bytenr = btrfs_node_blockptr(subv_parent, subv_slot);
> +	block->reloc_generation = btrfs_node_ptr_generation(subv_parent,
> +							    subv_slot);
> +	block->last_snapshot = last_snapshot;
> +	block->level = level;
> +	if (bg->flags & BTRFS_BLOCK_GROUP_DATA)
> +		block->trace_leaf = true;
> +	else
> +		block->trace_leaf = false;
> +	btrfs_node_key_to_cpu(reloc_parent, &block->first_key, reloc_slot);
> +
> +	/* Insert @block into @blocks */
> +	spin_lock(&blocks->lock);

The initialization of p would go here

> +	while (*p) {
> +		struct btrfs_qgroup_swapped_block *entry;
> +
> +		parent = *p;
> +		entry = rb_entry(parent, struct btrfs_qgroup_swapped_block,
> +				 node);
> +
> +		if (entry->subv_bytenr < block->subv_bytenr)
> +			p = &(*p)->rb_left;
> +		else if (entry->subv_bytenr > block->subv_bytenr)
> +			p = &(*p)->rb_right;
> +		else {

I've pointed out missing { } around if/else if/else statemtents in patch
1, would be good if you fix all patches.

> +			if (entry->subv_generation != block->subv_generation ||
> +			    entry->reloc_bytenr != block->reloc_bytenr ||
> +			    entry->reloc_generation !=
> +			    block->reloc_generation) {
> +				WARN_ON_ONCE(1);

This would need some explanation why you want to see the warning. _ONCE
is for the whole lifetime of the module but I think you'd be interested
on a per-filesystem basis. If this is meant for developers it should go
under DEBUG, but otherwise I don't see what would this warning bring to
the users.

> +				ret = -EEXIST;
> +			}
> +			kfree(block);
> +			goto out_unlock;
> +		}
> +	}
> +	rb_link_node(&block->node, parent, p);
> +	rb_insert_color(&block->node, &blocks->blocks[level]);
> +	blocks->swapped = true;
> +out_unlock:
> +	spin_unlock(&blocks->lock);
> +out:
> +	if (ret < 0)
> +		fs_info->qgroup_flags |=
> +			BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> +	return ret;
> +}
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 226956f0bd73..93d7f0998f03 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -6,6 +6,8 @@
>  #ifndef BTRFS_QGROUP_H
>  #define BTRFS_QGROUP_H
>  
> +#include <linux/spinlock.h>
> +#include <linux/rbtree.h>
>  #include "ulist.h"
>  #include "delayed-ref.h"
>  
> @@ -37,6 +39,66 @@
>   *    Normally at qgroup rescan and transaction commit time.
>   */
>  
> +/*
> + * Special performance hack for balance.

Is it really a hack? Hack sounds cool, but isn't it just an optimization?

> + *
> + * For balance, we need to swap subtree of file and reloc tree.
> + * In theory, we need to trace all subtree blocks of both file and reloc tree,
> + * since their owner has changed during such swap.
> + *
> + * However since balance has ensured that both subtrees are containing the
> + * same contents and have the same tree structures, such swap won't cause
> + * qgroup number change.
> + *
> + * But there is a race window between subtree swap and transaction commit,
> + * during that window, if we increase/decrease tree level or merge/split tree
> + * blocks, we still needs to trace original subtrees.
> + *
> + * So for balance, we use a delayed subtree trace, whose workflow is:
> + *
> + * 1) Record the subtree root block get swapped.
> + *
> + *    During subtree swap:
> + *    O = Old tree blocks
> + *    N = New tree blocks
> + *          reloc tree                         file tree X
> + *             Root                               Root
> + *            /    \                             /    \
> + *          NA     OB                          OA      OB
> + *        /  |     |  \                      /  |      |  \
> + *      NC  ND     OE  OF                   OC  OD     OE  OF
> + *
> + *   In these case, NA and OA is going to be swapped, record (NA, OA) into
> + *   file tree X.
> + *
> + * 2) After subtree swap.
> + *          reloc tree                         file tree X
> + *             Root                               Root
> + *            /    \                             /    \
> + *          OA     OB                          NA      OB
> + *        /  |     |  \                      /  |      |  \
> + *      OC  OD     OE  OF                   NC  ND     OE  OF
> + *
> + * 3a) CoW happens for OB
> + *     If we are going to CoW tree block OB, we check OB's bytenr against
> + *     tree X's swapped_blocks structure.
> + *     It doesn't fit any one, nothing will happen.
> + *
> + * 3b) CoW happens for NA
> + *     Check NA's bytenr against tree X's swapped_blocks, and get a hit.
> + *     Then we do subtree scan on both subtree OA and NA.
> + *     Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
> + *
> + *     Then no matter what we do to file tree X, qgroup numbers will
> + *     still be correct.
> + *     Then NA's record get removed from X's swapped_blocks.
> + *
> + * 4)  Transaction commit
> + *     Any record in X's swapped_blocks get removed, since there is no
> + *     modification to swapped subtrees, no need to trigger heavy qgroup
> + *     subtree rescan for them.
> + */
> +
>  /*
>   * Record a dirty extent, and info qgroup to update quota on it
>   * TODO: Use kmem cache to alloc it.
> @@ -59,6 +121,24 @@ struct btrfs_qgroup_extent_record {
>  	struct ulist *old_roots;
>  };
>  
> +struct btrfs_qgroup_swapped_block {
> +	struct rb_node node;
> +
> +	bool trace_leaf;
> +	int level;

Please reorder that so int goes before bool, othewise this adds
unnecessary padding.

> +
> +	/* bytenr/generation of the tree block in subvolume tree after swap */
> +	u64 subv_bytenr;
> +	u64 subv_generation;
> +
> +	/* bytenr/generation of the tree block in reloc tree after swap */
> +	u64 reloc_bytenr;
> +	u64 reloc_generation;
> +
> +	u64 last_snapshot;
> +	struct btrfs_key first_key;
> +};
> +
>  /*
>   * Qgroup reservation types:
>   *
> @@ -301,4 +381,23 @@ void btrfs_qgroup_convert_reserved_meta(struct btrfs_root *root, int num_bytes);
>  
>  void btrfs_qgroup_check_reserved_leak(struct inode *inode);
>  
> +/* btrfs_qgroup_swapped_blocks related functions */
> +static inline void btrfs_qgroup_init_swapped_blocks(

'inline' is not needed for a function that's called infrequently and
only from one location, an normal exported function would be ok.

> +		struct btrfs_qgroup_swapped_blocks *swapped_blocks)
> +{
> +	int i;
> +
> +	spin_lock_init(&swapped_blocks->lock);
> +	for (i = 0; i < BTRFS_MAX_LEVEL; i++)
> +		swapped_blocks->blocks[i] = RB_ROOT;
> +	swapped_blocks->swapped = false;
> +}
> +
> +void btrfs_qgroup_clean_swapped_blocks(struct btrfs_root *root);
> +int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
> +		struct btrfs_root *subv_root,
> +		struct btrfs_block_group_cache *bg,
> +		struct extent_buffer *subv_parent, int subv_slot,
> +		struct extent_buffer *reloc_parent, int reloc_slot,
> +		u64 last_snapshot);

Please keep an empty line before the final #endif

>  #endif

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 6/7] btrfs: qgroup: Use delayed subtree rescan for balance
  2019-01-15  8:16 ` [PATCH v4 6/7] btrfs: qgroup: Use delayed subtree rescan for balance Qu Wenruo
@ 2019-01-22 17:05   ` David Sterba
  0 siblings, 0 replies; 24+ messages in thread
From: David Sterba @ 2019-01-22 17:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jan 15, 2019 at 04:16:03PM +0800, Qu Wenruo wrote:
> +int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
> +					 struct btrfs_root *root,
> +					 struct extent_buffer *subv_eb)
> +{
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_qgroup_swapped_blocks *blocks = &root->swapped_blocks;
> +	struct btrfs_qgroup_swapped_block *block;
> +	struct extent_buffer *reloc_eb = NULL;
> +	struct rb_node *n;

Please don't use single letter variable here

> +	bool found = false;
> +	bool swapped = false;
> +	int level = btrfs_header_level(subv_eb);
> +	int ret = 0;
> +	int i;
> +
> +	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +		return 0;
> +	if (!is_fstree(root->root_key.objectid) || !root->reloc_root)
> +		return 0;
> +
> +	spin_lock(&blocks->lock);
> +	if (!blocks->swapped) {
> +		spin_unlock(&blocks->lock);
> +		goto out;

I think you can return here directly, as this is a shortcut and there's
no state that would need to be cleaned up (ret is still 0).

> +	}
> +	n = blocks->blocks[level].rb_node;
> +
> +	while (n) {
> +		block = rb_entry(n, struct btrfs_qgroup_swapped_block, node);
> +		if (block->subv_bytenr < subv_eb->start)
> +			n = n->rb_left;
> +		else if (block->subv_bytenr > subv_eb->start)
> +			n = n->rb_right;
> +		else {
> +			found = true;
> +			break;
> +		}
> +	}
> +	if (!found) {
> +		spin_unlock(&blocks->lock);
> +		goto out;
> +	}
> +	/* Found one, remove it from @blocks first and update blocks->swapped */
> +	rb_erase(&block->node, &blocks->blocks[level]);
> +	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
> +		if (RB_EMPTY_ROOT(&blocks->blocks[i])) {
> +			swapped = true;
> +			break;
> +		}
> +	}
> +	blocks->swapped = swapped;
> +	spin_unlock(&blocks->lock);
> +
> +	/* Read out reloc subtree root */
> +	reloc_eb = read_tree_block(fs_info, block->reloc_bytenr,
> +				   block->reloc_generation, block->level,
> +				   &block->first_key);
> +	if (IS_ERR(reloc_eb)) {
> +		ret = PTR_ERR(subv_eb);
> +		reloc_eb = NULL;
> +		goto free_out;
> +	}
> +	if (!extent_buffer_uptodate(reloc_eb)) {
> +		ret = -EIO;
> +		goto free_out;
> +	}
> +
> +	ret = qgroup_trace_subtree_swap(trans, reloc_eb, subv_eb,
> +			block->last_snapshot, block->trace_leaf);
> +free_out:
> +	kfree(block);
> +	free_extent_buffer(reloc_eb);
> +out:
> +	if (ret < 0) {
> +		btrfs_err_rl(fs_info,
> +			     "failed to account subtree at bytenr %llu: %d",
> +			     subv_eb->start, ret);
> +		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> +	}
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 5/7] btrfs: qgroup: Introduce per-root swapped blocks infrastructure
  2019-01-22 16:55   ` David Sterba
@ 2019-01-22 23:07     ` Qu Wenruo
  2019-01-24 18:44       ` David Sterba
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-01-22 23:07 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2745 bytes --]



On 2019/1/23 上午12:55, David Sterba wrote:
> On Tue, Jan 15, 2019 at 04:16:02PM +0800, Qu Wenruo wrote:
<snip>
>> +
>> +	block = kmalloc(sizeof(*block), GFP_NOFS);
>> +	if (!block) {
>> +		ret = -ENOMEM;
> 
> This goes to 'out' that marks quota as inconsistent, is this
> intentional?

Yes

> Ie. if this function does not succed for whatever reason,
> the quotas are indeed inconsistent.

If we hit this ENOMEM case, although it's possible later CoW doesn't
touch this subtree, it's also possible later CoW touches this and make
qgroup numbers inconsistent.

So we should mark qgroup inconsistent for the possibility of
inconsistent qgroup numbers.

[snip]
> 
>> +	while (*p) {
>> +		struct btrfs_qgroup_swapped_block *entry;
>> +
>> +		parent = *p;
>> +		entry = rb_entry(parent, struct btrfs_qgroup_swapped_block,
>> +				 node);
>> +
>> +		if (entry->subv_bytenr < block->subv_bytenr)
>> +			p = &(*p)->rb_left;
>> +		else if (entry->subv_bytenr > block->subv_bytenr)
>> +			p = &(*p)->rb_right;
>> +		else {
> 
> I've pointed out missing { } around if/else if/else statemtents in patch
> 1, would be good if you fix all patches.

Sorry, it's sometimes hard to apply the same comment to older branches.
Normally I'll pay more attention for newly written code, but it's not
that easy to spot older code.

> 
>> +			if (entry->subv_generation != block->subv_generation ||
>> +			    entry->reloc_bytenr != block->reloc_bytenr ||
>> +			    entry->reloc_generation !=
>> +			    block->reloc_generation) {
>> +				WARN_ON_ONCE(1);
> 
> This would need some explanation why you want to see the warning.

The case here is, I don't want to populate the whole dmesg with tons of
possible backtrace from this call sites, but I still hope user to report
such error.

So I choose _ONCE() to inform user to report the error, as I don't
really see it's possible.

> _ONCE
> is for the whole lifetime of the module but I think you'd be interested
> on a per-filesystem basis. If this is meant for developers it should go
> under DEBUG, but otherwise I don't see what would this warning bring to
> the users.

OK, I'll put it under DEBUG and change from ONCE to regular WARN_ON().

[snip]
>>  
>> +/*
>> + * Special performance hack for balance.
> 
> Is it really a hack? Hack sounds cool, but isn't it just an optimization?

It's optimization indeed.
(But hack sounds really cool).


>>  
>> +struct btrfs_qgroup_swapped_block {
>> +	struct rb_node node;
>> +
>> +	bool trace_leaf;
>> +	int level;
> 
> Please reorder that so int goes before bool, othewise this adds
> unnecessary padding.

Can we make compiler to do this work?

Thanks,
Qu

[snip]


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead
  2019-01-22 16:28 ` David Sterba
@ 2019-01-23  5:43   ` Qu Wenruo
  0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2019-01-23  5:43 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1305 bytes --]



On 2019/1/23 上午12:28, David Sterba wrote:
> On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
>>
>> Which is based on v5.0-rc1.
>>
>> This patch address the heavy load subtree scan, but delaying it until
>> we're going to modify the swapped tree block.
> 
> I'd like to merge this patchset, the overall logic is fine but I've
> found lots of coding style issues, too many to fix them just myself.
> 
>> v4:
>> - Renaming members from "file_*" to "subv_*".
>>   Members like "file_bytenr" is pretty confusing, renaming it to
>>   "subv_bytenr" avoid the confusion.
> 
> The naming change is incomplete, there are many more references to 'file
> tree' in comments and changelogs, the naming should be unified
> everywhere. Also I'd prefer to use 'subvol' though it's a bit longer
> than 'subv', the former seems to be the most common form.

I have to admit this 'subv' naming is completely my bad.

I tend to keep the naming as long as the original 'file' so no format
change is needed at all.
But it turns out that trick doesn't work at all.

I'll get rid of the bad behavior.

Thanks,
Qu

> 
> I'll comment the patches with the rest.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 3/7] btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots()
  2019-01-22 16:32   ` David Sterba
@ 2019-01-23  6:01     ` Qu Wenruo
  0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2019-01-23  6:01 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1447 bytes --]



On 2019/1/23 上午12:32, David Sterba wrote:
> On Tue, Jan 15, 2019 at 04:16:00PM +0800, Qu Wenruo wrote:
>> And to co-operate this, also delayed btrfs_drop_snapshot() call on reloc
>> tree, btrfs_drop_snapshot() call will also be delayed to
>> clean_dirty_subvs().
> 
> Can you please rephrase this paragraph?
> 
>> This patch will increase the size of btrfs_root by 16 bytes.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
>> +static int clean_dirty_subvs(struct reloc_control *rc)
>> +{
>> +	struct btrfs_root *root;
>> +	struct btrfs_root *next;
>> +	int err = 0;
>> +	int ret;
>> +
>> +	list_for_each_entry_safe(root, next, &rc->dirty_subv_roots,
>> +				 reloc_dirty_list) {
>> +		struct btrfs_root *reloc_root = root->reloc_root;
>> +
>> +		clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
>> +		list_del_init(&root->reloc_dirty_list);
>> +		root->reloc_root = NULL;
>> +		if (reloc_root) {
>> +			ret = btrfs_drop_snapshot(reloc_root, NULL, 0, 1);
>> +			if (ret < 0 && !err)
>> +				err = ret;
>> +		}
>> +		btrfs_put_fs_root(root);
>> +	}
>> +	return err;
> 
> Please dont use the err/ret style but 'ret' that matches function return
> type and for the temporary return values ret2 etc.

For this policy, the primary objective is to avoid the confusion between
err and ret, right?

Then I'd prefer tmp_ret over ret2, although it's just a personal taste.

Thanks,
Qu

> 
>> +}


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 5/7] btrfs: qgroup: Introduce per-root swapped blocks infrastructure
  2019-01-22 23:07     ` Qu Wenruo
@ 2019-01-24 18:44       ` David Sterba
  0 siblings, 0 replies; 24+ messages in thread
From: David Sterba @ 2019-01-24 18:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Wed, Jan 23, 2019 at 07:07:50AM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/1/23 上午12:55, David Sterba wrote:
> > On Tue, Jan 15, 2019 at 04:16:02PM +0800, Qu Wenruo wrote:
> <snip>
> >> +
> >> +	block = kmalloc(sizeof(*block), GFP_NOFS);
> >> +	if (!block) {
> >> +		ret = -ENOMEM;
> > 
> > This goes to 'out' that marks quota as inconsistent, is this
> > intentional?
> 
> Yes
> 
> > Ie. if this function does not succed for whatever reason,
> > the quotas are indeed inconsistent.
> 
> If we hit this ENOMEM case, although it's possible later CoW doesn't
> touch this subtree, it's also possible later CoW touches this and make
> qgroup numbers inconsistent.
> 
> So we should mark qgroup inconsistent for the possibility of
> inconsistent qgroup numbers.

I see, thanks.

> [snip]
> > 
> >> +	while (*p) {
> >> +		struct btrfs_qgroup_swapped_block *entry;
> >> +
> >> +		parent = *p;
> >> +		entry = rb_entry(parent, struct btrfs_qgroup_swapped_block,
> >> +				 node);
> >> +
> >> +		if (entry->subv_bytenr < block->subv_bytenr)
> >> +			p = &(*p)->rb_left;
> >> +		else if (entry->subv_bytenr > block->subv_bytenr)
> >> +			p = &(*p)->rb_right;
> >> +		else {
> > 
> > I've pointed out missing { } around if/else if/else statemtents in patch
> > 1, would be good if you fix all patches.
> 
> Sorry, it's sometimes hard to apply the same comment to older branches.
> Normally I'll pay more attention for newly written code, but it's not
> that easy to spot older code.
> 
> > 
> >> +			if (entry->subv_generation != block->subv_generation ||
> >> +			    entry->reloc_bytenr != block->reloc_bytenr ||
> >> +			    entry->reloc_generation !=
> >> +			    block->reloc_generation) {
> >> +				WARN_ON_ONCE(1);
> > 
> > This would need some explanation why you want to see the warning.
> 
> The case here is, I don't want to populate the whole dmesg with tons of
> possible backtrace from this call sites, but I still hope user to report
> such error.
> 
> So I choose _ONCE() to inform user to report the error, as I don't
> really see it's possible.
> 
> > _ONCE
> > is for the whole lifetime of the module but I think you'd be interested
> > on a per-filesystem basis. If this is meant for developers it should go
> > under DEBUG, but otherwise I don't see what would this warning bring to
> > the users.
> 
> OK, I'll put it under DEBUG and change from ONCE to regular WARN_ON().

Works for me.

> [snip]
> >>  
> >> +/*
> >> + * Special performance hack for balance.
> > 
> > Is it really a hack? Hack sounds cool, but isn't it just an optimization?
> 
> It's optimization indeed.
> (But hack sounds really cool).
> 
> 
> >>  
> >> +struct btrfs_qgroup_swapped_block {
> >> +	struct rb_node node;
> >> +
> >> +	bool trace_leaf;
> >> +	int level;
> > 
> > Please reorder that so int goes before bool, othewise this adds
> > unnecessary padding.
> 
> Can we make compiler to do this work?

No, compiler must keep the order of definition and may add padding to
satify the alignment constraints. It can warn though, -Wpadded but it's
too noisy so you'd have to diff the warnings to see if your code adds
new paddings. For any new structure or new member I use the tool
'pahole' that prints the detailed layout.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-01-24 18:45 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-15  8:15 [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Qu Wenruo
2019-01-15  8:15 ` [PATCH v4 1/7] btrfs: qgroup: Move reserved data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record Qu Wenruo
2019-01-15  8:15 ` [PATCH v4 2/7] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time Qu Wenruo
2019-01-15  8:16 ` [PATCH v4 3/7] btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots() Qu Wenruo
2019-01-22 16:32   ` David Sterba
2019-01-23  6:01     ` Qu Wenruo
2019-01-15  8:16 ` [PATCH v4 4/7] btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap() Qu Wenruo
2019-01-22 16:38   ` David Sterba
2019-01-15  8:16 ` [PATCH v4 5/7] btrfs: qgroup: Introduce per-root swapped blocks infrastructure Qu Wenruo
2019-01-22 16:55   ` David Sterba
2019-01-22 23:07     ` Qu Wenruo
2019-01-24 18:44       ` David Sterba
2019-01-15  8:16 ` [PATCH v4 6/7] btrfs: qgroup: Use delayed subtree rescan for balance Qu Wenruo
2019-01-22 17:05   ` David Sterba
2019-01-15  8:16 ` [PATCH v4 7/7] btrfs: qgroup: Cleanup old subtree swap code Qu Wenruo
2019-01-15 17:26 ` [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead Josef Bacik
2019-01-16  0:31   ` Qu Wenruo
2019-01-16  1:07     ` Qu Wenruo
2019-01-16  1:15       ` Josef Bacik
2019-01-16  1:29         ` Qu Wenruo
2019-01-16  1:34           ` Josef Bacik
2019-01-16  1:36             ` Qu Wenruo
2019-01-22 16:28 ` David Sterba
2019-01-23  5:43   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).