All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/8] A variety of lock contention fixes
@ 2020-12-18 19:24 Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 1/8] btrfs: do not block on deleted bgs mutex in the cleaner Josef Bacik
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

v4->v5:
- Added "btrfs: remove bogus BUG_ON in alloc_reserved_tree_block", as Nikolay
  pointed out I needed to explain why we no longer needed one of the delayed ref
  flushes, which led me down the rabbit hole of trying to understand why it
  wasn't a problem anymore.  Turned out the BUG_ON() is bogus.
- Added "btrfs: move delayed ref flushing for qgroup into qgroup helper",
  instead of removing the flushing for qgroups completely, we still sort of need
  it, even though it's actually still broken, so I've moved it into the qgroup
  helper.
- Added Nikolay's rb for the last patch.

v3->v4:
- I accidentally sent out the v1 version of these patches, because I had fixed
  them on another machine.  This is the proper set with the changes from v2 that
  are properly rebased onto misc-next.

v2->v3:
- Added Nikolay's reviewed by for the second patch.
- Rebased onto the latest misc-next.

v1->v2:
- Fixed the log messages that Nikolay pointed out.
- Added Nikolay's reviewed by for the first patch.
- Removed the unneeded mb for flushing.

--- Original email ---
Hello,

I've been running some stress tests recently in order to try and reproduce some
problems I've tripped over in relocation.  Most of this series is a reposting of
patches I wrote when debugging related issues for Zygo that got lost.  I've
updated one of them to make the lock contention even better, making it so I have
to ramp up my stress test loops because it now finishes way too fast.  Thanks,

Josef

Josef Bacik (8):
  btrfs: do not block on deleted bgs mutex in the cleaner
  btrfs: only let one thread pre-flush delayed refs in commit
  btrfs: delayed refs pre-flushing should only run the heads we have
  btrfs: only run delayed refs once before committing
  btrfs: move delayed ref flushing for qgroup into qgroup helper
  btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
  btrfs: stop running all delayed refs during snapshot
  btrfs: run delayed refs less often in commit_cowonly_roots

 fs/btrfs/block-group.c | 11 +++--
 fs/btrfs/delayed-ref.h | 12 +++---
 fs/btrfs/extent-tree.c |  3 +-
 fs/btrfs/transaction.c | 91 +++++++++++++++++++++---------------------
 4 files changed, 60 insertions(+), 57 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 1/8] btrfs: do not block on deleted bgs mutex in the cleaner
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
@ 2020-12-18 19:24 ` Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit Josef Bacik
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Nikolay Borisov

While running some stress tests I started getting hung task messages.
This is because the delete unused bg's code has to take the
delete_unused_bgs_mutex to do it's work, which is taken by balance to
make sure we don't delete block groups while we're balancing.

The problem is a balance can take a while, and so we were getting hung
task warnings.  We don't need to block and run these things, and the
cleaner is needed to do other work, so trylock on this mutex and just
bail if we can't acquire it right away.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 52f2198d44c9..f61e275bd792 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1254,6 +1254,13 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
 		return;
 
+	/*
+	 * Long running balances can keep us blocked here for eternity, so
+	 * simply skip deletion if we're unable to get the mutex.
+	 */
+	if (!mutex_trylock(&fs_info->delete_unused_bgs_mutex))
+		return;
+
 	spin_lock(&fs_info->unused_bgs_lock);
 	while (!list_empty(&fs_info->unused_bgs)) {
 		int trimming;
@@ -1273,8 +1280,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 
 		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
 
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
-
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
 
@@ -1420,11 +1425,11 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 end_trans:
 		btrfs_end_transaction(trans);
 next:
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
 		btrfs_put_block_group(block_group);
 		spin_lock(&fs_info->unused_bgs_lock);
 	}
 	spin_unlock(&fs_info->unused_bgs_lock);
+	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
 	return;
 
 flip_async:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 1/8] btrfs: do not block on deleted bgs mutex in the cleaner Josef Bacik
@ 2020-12-18 19:24 ` Josef Bacik
  2021-01-08 16:01   ` David Sterba
  2020-12-18 19:24 ` [PATCH v5 3/8] btrfs: delayed refs pre-flushing should only run the heads we have Josef Bacik
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Nikolay Borisov

I've been running a stress test that runs 20 workers in their own
subvolume, which are running an fsstress instance with 4 threads per
worker, which is 80 total fsstress threads.  In addition to this I'm
running balance in the background as well as creating and deleting
snapshots.  This test takes around 12 hours to run normally, going
slower and slower as the test goes on.

The reason for this is because fsstress is running fsync sometimes, and
because we're messing with block groups we often fall through to
btrfs_commit_transaction, so will often have 20-30 threads all calling
btrfs_commit_transaction at the same time.

These all get stuck contending on the extent tree while they try to run
delayed refs during the initial part of the commit.

This is suboptimal, really because the extent tree is a single point of
failure we only want one thread acting on that tree at once to reduce
lock contention.  Fix this by making the flushing mechanism a bit
operation, to make it easy to use test_and_set_bit() in order to make
sure only one task does this initial flush.

Once we're into the transaction commit we only have one thread doing
delayed ref running, it's just this initial pre-flush that is
problematic.  With this patch my stress test takes around 90 minutes to
run, instead of 12 hours.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/delayed-ref.h | 12 ++++++------
 fs/btrfs/transaction.c | 33 ++++++++++++++++-----------------
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 1c977e6d45dc..6e414785b56f 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -135,6 +135,11 @@ struct btrfs_delayed_data_ref {
 	u64 offset;
 };
 
+enum btrfs_delayed_ref_flags {
+	/* Used to indicate that we are flushing delayed refs for the commit. */
+	BTRFS_DELAYED_REFS_FLUSHING,
+};
+
 struct btrfs_delayed_ref_root {
 	/* head ref rbtree */
 	struct rb_root_cached href_root;
@@ -158,12 +163,7 @@ struct btrfs_delayed_ref_root {
 
 	u64 pending_csums;
 
-	/*
-	 * set when the tree is flushing before a transaction commit,
-	 * used by the throttling code to decide if new updates need
-	 * to be run right away
-	 */
-	int flushing;
+	unsigned long flags;
 
 	u64 run_delayed_start;
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f51f9e39bcee..ccd37fbe5db1 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -909,9 +909,9 @@ bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
 
-	smp_mb();
 	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
-	    cur_trans->delayed_refs.flushing)
+	    test_bit(BTRFS_DELAYED_REFS_FLUSHING,
+		     &cur_trans->delayed_refs.flags))
 		return true;
 
 	return should_end_transaction(trans);
@@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	btrfs_trans_release_metadata(trans);
 	trans->block_rsv = NULL;
 
-	/* make a pass through all the delayed refs we have so far
-	 * any runnings procs may add more while we are here
-	 */
-	ret = btrfs_run_delayed_refs(trans, 0);
-	if (ret) {
-		btrfs_end_transaction(trans);
-		return ret;
-	}
-
-	cur_trans = trans->transaction;
-
 	/*
-	 * set the flushing flag so procs in this transaction have to
-	 * start sending their work down.
+	 * We only want one transaction commit doing the flushing so we do not
+	 * waste a bunch of time on lock contention on the extent root node.
 	 */
-	cur_trans->delayed_refs.flushing = 1;
-	smp_wmb();
+	if (!test_and_set_bit(BTRFS_DELAYED_REFS_FLUSHING,
+			      &cur_trans->delayed_refs.flags)) {
+		/*
+		 * make a pass through all the delayed refs we have so far
+		 * any runnings procs may add more while we are here
+		 */
+		ret = btrfs_run_delayed_refs(trans, 0);
+		if (ret) {
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+	}
 
 	btrfs_create_pending_block_groups(trans);
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 3/8] btrfs: delayed refs pre-flushing should only run the heads we have
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 1/8] btrfs: do not block on deleted bgs mutex in the cleaner Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit Josef Bacik
@ 2020-12-18 19:24 ` Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 4/8] btrfs: only run delayed refs once before committing Josef Bacik
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Nikolay Borisov

Previously our delayed ref running used the total number of items as the
items to run.  However we changed that to number of heads to run with
the delayed_refs_rsv, as generally we want to run all of the operations
for one bytenr.

But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the
number of items that we have.  This is generally fine, but if we have
some operation generation loads of delayed refs while we're doing this
pre-flushing in the transaction commit, we'll just spin forever doing
delayed refs.

Fix this to simply pick the number of delayed refs we currently have,
that way we do not end up doing a lot of extra work that's being
generated in other threads.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d79b8369e6aa..b6d774803a2c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2160,7 +2160,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 
 	delayed_refs = &trans->transaction->delayed_refs;
 	if (count == 0)
-		count = atomic_read(&delayed_refs->num_entries) * 2;
+		count = delayed_refs->num_heads_ready;
 
 again:
 #ifdef SCRAMBLE_DELAYED_REFS
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 4/8] btrfs: only run delayed refs once before committing
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
                   ` (2 preceding siblings ...)
  2020-12-18 19:24 ` [PATCH v5 3/8] btrfs: delayed refs pre-flushing should only run the heads we have Josef Bacik
@ 2020-12-18 19:24 ` Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 5/8] btrfs: move delayed ref flushing for qgroup into qgroup helper Josef Bacik
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Nikolay Borisov

We try to pre-flush the delayed refs when committing, because we want to
do as little work as possible in the critical section of the transaction
commit.

However doing this twice can lead to very long transaction commit delays
as other threads are allowed to continue to generate more delayed refs,
which potentially delays the commit by multiple minutes in very extreme
cases.

So simply stick to one pre-flush, and then continue the rest of the
transaction commit.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/transaction.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index ccd37fbe5db1..4776e055f7f9 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2062,12 +2062,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	btrfs_create_pending_block_groups(trans);
 
-	ret = btrfs_run_delayed_refs(trans, 0);
-	if (ret) {
-		btrfs_end_transaction(trans);
-		return ret;
-	}
-
 	if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &cur_trans->flags)) {
 		int run_it = 0;
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 5/8] btrfs: move delayed ref flushing for qgroup into qgroup helper
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
                   ` (3 preceding siblings ...)
  2020-12-18 19:24 ` [PATCH v5 4/8] btrfs: only run delayed refs once before committing Josef Bacik
@ 2020-12-18 19:24 ` Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 6/8] btrfs: remove bogus BUG_ON in alloc_reserved_tree_block Josef Bacik
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The commit d67263354541 ("btrfs: qgroup: Make snapshot accounting work
with new extent-oriented qgroup.") added a flush of the delayed refs
during snapshot creation in order to get the qgroup accounting properly.
However this code has changed and been moved to it's own helper that is
skipped if qgroups are turned off.  Move the flushing to the helper, as
we do not need it when qgroups are turned off.

Also add a comment explaining why it exists, and why it doesn't actually
save us.  This will be helpful later when we try to fix qgroup
accounting properly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/transaction.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4776e055f7f9..5738763c514b 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1433,6 +1433,23 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
 	 */
 	record_root_in_trans(trans, src, 1);
 
+	/*
+	 * btrfs_qgroup_inherit relies on a consistent view of the usage for the
+	 * src root, so we must run the delayed refs here.
+	 *
+	 * However this isn't particularly fool proof, because there's no
+	 * synchronization keeping us from changing the tree after this point
+	 * before we do the qgroup_inherit, or even from making changes while
+	 * we're doing the qgroup_inherit.  But that's a problem for the future,
+	 * for now flush the delayed refs to narrow the race window where the
+	 * qgroup counters could end up wrong.
+	 */
+	ret = btrfs_run_delayed_refs(trans, (unsigned long)-1);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
+
 	/*
 	 * We are going to commit transaction, see btrfs_commit_transaction()
 	 * comment for reason locking tree_log_mutex
@@ -1686,12 +1703,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	ret = btrfs_run_delayed_refs(trans, (unsigned long)-1);
-	if (ret) {
-		btrfs_abort_transaction(trans, ret);
-		goto fail;
-	}
-
 	/*
 	 * Do special qgroup accounting for snapshot, as we do some qgroup
 	 * snapshot hack to do fast snapshot.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 6/8] btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
                   ` (4 preceding siblings ...)
  2020-12-18 19:24 ` [PATCH v5 5/8] btrfs: move delayed ref flushing for qgroup into qgroup helper Josef Bacik
@ 2020-12-18 19:24 ` Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 7/8] btrfs: stop running all delayed refs during snapshot Josef Bacik
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The fix 361048f586f5 ("Btrfs: fix full backref problem when inserting
shared block reference") added a delayed ref flushing at subvolume
creation time in order to avoid hitting this particular BUG_ON().

Before this fix, we were tripping the BUG_ON() by

1. Modify snapshot A, which creates blocks with a normal reference for
   snapshot A, as A is the owner of these blocks.  We now have delayed
   refs for these blocks.
2. Create a snapshot of A named B, which pushes references for the
   children blocks of the root node for the new root B, thus creating
   more delayed refs for newly allocated blocks.
3. A is modified, and because the metadata blocks can now be shared, it
   must push FULL_BACKREF references to the children of any block that A
   cow's down it's path to its target key.
4. Delayed refs are run.  Because these are newly allocated blocks, we
   have ->must_insert_reserved reserved set on the delayed ref head, we
   call into alloc_reserved_tree_block() to add the extent item, and
   then add our ref.  At the time of this fix, we were ordering
   FULL_BACKREF delayed ref operations first, so we'd go to add this
   reference and then BUG_ON() because we didn't have the FULL_BACKREF
   flag set.

The patch fixed this problem by making sure we ran the delayed refs
before we had the chance to modify A.  This meant that any *new* blocks
would have had their extent items created _before_ we would ever
actually cow down and generate FULL_BACKREF entries.  Thus the problem
went away.

However this BUG_ON() is actually completely bogus.  The existence of a
full backref doesn't necessarily mean that FULL_BACKREF must be set on
that block, it must only be set on the actual parent itself.  Consider
the example provided above.  If we cow down one path from A, any nodes
are going to have a FULL_BACKREF ref pushed down to _all_ of their
children, but not all of the children are going to have FULL_BACKREF
set.  It is completely valid to have an extent item with normal and full
back refs without FULL_BACKREF actually set on the block itself.

As a final note, I have been testing with the patch

  btrfs: stop running all delayed refs during snapshot

which removed this flushing.  My test was a torture test which did a lot
of operations while snapshotting and deleting snapshots as well as
relocation, and I never tripped this BUG_ON().  This is actually because
at the time of 361048f586f5, we ordered SHARED keys _before_ normal
references, and thus they would get run first.  However currently they
are ordered _after_ normal references, so we'd do the initial creation
without having a shared reference, and thus not hit this BUG_ON(), which
explains why I didn't start hitting this problem during my testing with
my other patch applied.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b6d774803a2c..c4846694ae9c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4516,7 +4516,6 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 	}
 
 	if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
-		BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));
 		btrfs_set_extent_inline_ref_type(leaf, iref,
 						 BTRFS_SHARED_BLOCK_REF_KEY);
 		btrfs_set_extent_inline_ref_offset(leaf, iref, ref->parent);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 7/8] btrfs: stop running all delayed refs during snapshot
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
                   ` (5 preceding siblings ...)
  2020-12-18 19:24 ` [PATCH v5 6/8] btrfs: remove bogus BUG_ON in alloc_reserved_tree_block Josef Bacik
@ 2020-12-18 19:24 ` Josef Bacik
  2020-12-18 19:24 ` [PATCH v5 8/8] btrfs: run delayed refs less often in commit_cowonly_roots Josef Bacik
  2021-01-26 17:51 ` [PATCH v5 0/8] A variety of lock contention fixes David Sterba
  8 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

This was added in commit 361048f586f5 ("Btrfs: fix full backref problem
when inserting shared block reference") to address a problem where we
hit the following BUG_ON() in alloc_reserved_tree_block

        if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
                BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));

However this BUG_ON() is bogus, and was removed by

  btrfs: remove bogus BUG_ON in alloc_reserved_tree_block

We no longer need to run delayed refs because of this, and can remove
this flushing here.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/transaction.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5738763c514b..b3c9da5833db 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1750,12 +1750,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	ret = btrfs_run_delayed_refs(trans, (unsigned long)-1);
-	if (ret) {
-		btrfs_abort_transaction(trans, ret);
-		goto fail;
-	}
-
 fail:
 	pending->error = ret;
 dir_item_existed:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 8/8] btrfs: run delayed refs less often in commit_cowonly_roots
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
                   ` (6 preceding siblings ...)
  2020-12-18 19:24 ` [PATCH v5 7/8] btrfs: stop running all delayed refs during snapshot Josef Bacik
@ 2020-12-18 19:24 ` Josef Bacik
  2021-01-26 17:51 ` [PATCH v5 0/8] A variety of lock contention fixes David Sterba
  8 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2020-12-18 19:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Nikolay Borisov

We love running delayed refs in commit_cowonly_roots, but it is a bit
excessive.  I was seeing cases of running 3 or 4 refs a few times in a
row during this time.  Instead simply update all of the roots first,
then run delayed refs, then handle the empty block groups case, and then
if we have any more dirty roots do the whole thing again.  This allows
us to be much more efficient with our delayed ref running, as we can
batch a few more operations at once.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/transaction.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index b3c9da5833db..8a6b38c01277 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1227,10 +1227,6 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans)
 	btrfs_tree_unlock(eb);
 	free_extent_buffer(eb);
 
-	if (ret)
-		return ret;
-
-	ret = btrfs_run_delayed_refs(trans, (unsigned long)-1);
 	if (ret)
 		return ret;
 
@@ -1248,10 +1244,6 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans)
 	if (ret)
 		return ret;
 
-	/* run_qgroups might have added some more refs */
-	ret = btrfs_run_delayed_refs(trans, (unsigned long)-1);
-	if (ret)
-		return ret;
 again:
 	while (!list_empty(&fs_info->dirty_cowonly_roots)) {
 		struct btrfs_root *root;
@@ -1266,15 +1258,24 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans)
 		ret = update_cowonly_root(trans, root);
 		if (ret)
 			return ret;
-		ret = btrfs_run_delayed_refs(trans, (unsigned long)-1);
-		if (ret)
-			return ret;
 	}
 
+	/* Now flush any delayed refs generated by updating all of the roots. */
+	ret = btrfs_run_delayed_refs(trans, (unsigned long)-1);
+	if (ret)
+		return ret;
+
 	while (!list_empty(dirty_bgs) || !list_empty(io_bgs)) {
 		ret = btrfs_write_dirty_block_groups(trans);
 		if (ret)
 			return ret;
+
+		/*
+		 * We're writing the dirty block groups, which could generate
+		 * delayed refs, which could generate more dirty block groups,
+		 * so we want to keep this flushing in this loop to make sure
+		 * everything gets run.
+		 */
 		ret = btrfs_run_delayed_refs(trans, (unsigned long)-1);
 		if (ret)
 			return ret;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit
  2020-12-18 19:24 ` [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit Josef Bacik
@ 2021-01-08 16:01   ` David Sterba
  2021-01-11  8:33     ` Nikolay Borisov
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2021-01-08 16:01 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, Nikolay Borisov

On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
> I've been running a stress test that runs 20 workers in their own
> subvolume, which are running an fsstress instance with 4 threads per
> worker, which is 80 total fsstress threads.  In addition to this I'm
> running balance in the background as well as creating and deleting
> snapshots.  This test takes around 12 hours to run normally, going
> slower and slower as the test goes on.
> 
> The reason for this is because fsstress is running fsync sometimes, and
> because we're messing with block groups we often fall through to
> btrfs_commit_transaction, so will often have 20-30 threads all calling
> btrfs_commit_transaction at the same time.
> 
> These all get stuck contending on the extent tree while they try to run
> delayed refs during the initial part of the commit.
> 
> This is suboptimal, really because the extent tree is a single point of
> failure we only want one thread acting on that tree at once to reduce
> lock contention.  Fix this by making the flushing mechanism a bit
> operation, to make it easy to use test_and_set_bit() in order to make
> sure only one task does this initial flush.
> 
> Once we're into the transaction commit we only have one thread doing
> delayed ref running, it's just this initial pre-flush that is
> problematic.  With this patch my stress test takes around 90 minutes to
> run, instead of 12 hours.
> 
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/delayed-ref.h | 12 ++++++------
>  fs/btrfs/transaction.c | 33 ++++++++++++++++-----------------
>  2 files changed, 22 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 1c977e6d45dc..6e414785b56f 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -135,6 +135,11 @@ struct btrfs_delayed_data_ref {
>  	u64 offset;
>  };
>  
> +enum btrfs_delayed_ref_flags {
> +	/* Used to indicate that we are flushing delayed refs for the commit. */
> +	BTRFS_DELAYED_REFS_FLUSHING,
> +};
> +
>  struct btrfs_delayed_ref_root {
>  	/* head ref rbtree */
>  	struct rb_root_cached href_root;
> @@ -158,12 +163,7 @@ struct btrfs_delayed_ref_root {
>  
>  	u64 pending_csums;
>  
> -	/*
> -	 * set when the tree is flushing before a transaction commit,
> -	 * used by the throttling code to decide if new updates need
> -	 * to be run right away
> -	 */
> -	int flushing;
> +	unsigned long flags;
>  
>  	u64 run_delayed_start;
>  
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f51f9e39bcee..ccd37fbe5db1 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -909,9 +909,9 @@ bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
>  {
>  	struct btrfs_transaction *cur_trans = trans->transaction;
>  
> -	smp_mb();
>  	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
> -	    cur_trans->delayed_refs.flushing)
> +	    test_bit(BTRFS_DELAYED_REFS_FLUSHING,
> +		     &cur_trans->delayed_refs.flags))
>  		return true;
>  
>  	return should_end_transaction(trans);
> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	btrfs_trans_release_metadata(trans);
>  	trans->block_rsv = NULL;
>  
> -	/* make a pass through all the delayed refs we have so far
> -	 * any runnings procs may add more while we are here
> -	 */
> -	ret = btrfs_run_delayed_refs(trans, 0);
> -	if (ret) {
> -		btrfs_end_transaction(trans);
> -		return ret;
> -	}
> -
> -	cur_trans = trans->transaction;
> -
>  	/*
> -	 * set the flushing flag so procs in this transaction have to
> -	 * start sending their work down.
> +	 * We only want one transaction commit doing the flushing so we do not
> +	 * waste a bunch of time on lock contention on the extent root node.
>  	 */
> -	cur_trans->delayed_refs.flushing = 1;
> -	smp_wmb();

This barrier obviously separates the flushing = 1 and the rest of the
code, now implemented as test_and_set_bit, which implies full barrier.

However, hunk in btrfs_should_end_transaction removes the barrier and
I'm not sure whether this is correct:

-	smp_mb();
 	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
-	    cur_trans->delayed_refs.flushing)
+	    test_bit(BTRFS_DELAYED_REFS_FLUSHING,
+		     &cur_trans->delayed_refs.flags))
 		return true;

This is never called under locks so we don't have complete
synchronization of neither the transaction state nor the flushing bit.
btrfs_should_end_transaction is merely a hint and not called in critical
places so we could probably afford to keep it without a barrier, or keep
it with comment(s).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit
  2021-01-08 16:01   ` David Sterba
@ 2021-01-11  8:33     ` Nikolay Borisov
  2021-01-11 21:50       ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Nikolay Borisov @ 2021-01-11  8:33 UTC (permalink / raw)
  To: dsterba, Josef Bacik, linux-btrfs, kernel-team



On 8.01.21 г. 18:01 ч., David Sterba wrote:
> On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
>> I've been running a stress test that runs 20 workers in their own
>> subvolume, which are running an fsstress instance with 4 threads per
>> worker, which is 80 total fsstress threads.  In addition to this I'm
>> running balance in the background as well as creating and deleting
>> snapshots.  This test takes around 12 hours to run normally, going
>> slower and slower as the test goes on.
>>
>> The reason for this is because fsstress is running fsync sometimes, and
>> because we're messing with block groups we often fall through to
>> btrfs_commit_transaction, so will often have 20-30 threads all calling
>> btrfs_commit_transaction at the same time.
>>
>> These all get stuck contending on the extent tree while they try to run
>> delayed refs during the initial part of the commit.
>>
>> This is suboptimal, really because the extent tree is a single point of
>> failure we only want one thread acting on that tree at once to reduce
>> lock contention.  Fix this by making the flushing mechanism a bit
>> operation, to make it easy to use test_and_set_bit() in order to make
>> sure only one task does this initial flush.
>>
>> Once we're into the transaction commit we only have one thread doing
>> delayed ref running, it's just this initial pre-flush that is
>> problematic.  With this patch my stress test takes around 90 minutes to
>> run, instead of 12 hours.
>>
>> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>  fs/btrfs/delayed-ref.h | 12 ++++++------
>>  fs/btrfs/transaction.c | 33 ++++++++++++++++-----------------
>>  2 files changed, 22 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
>> index 1c977e6d45dc..6e414785b56f 100644
>> --- a/fs/btrfs/delayed-ref.h
>> +++ b/fs/btrfs/delayed-ref.h
>> @@ -135,6 +135,11 @@ struct btrfs_delayed_data_ref {
>>  	u64 offset;
>>  };
>>  
>> +enum btrfs_delayed_ref_flags {
>> +	/* Used to indicate that we are flushing delayed refs for the commit. */
>> +	BTRFS_DELAYED_REFS_FLUSHING,
>> +};
>> +
>>  struct btrfs_delayed_ref_root {
>>  	/* head ref rbtree */
>>  	struct rb_root_cached href_root;
>> @@ -158,12 +163,7 @@ struct btrfs_delayed_ref_root {
>>  
>>  	u64 pending_csums;
>>  
>> -	/*
>> -	 * set when the tree is flushing before a transaction commit,
>> -	 * used by the throttling code to decide if new updates need
>> -	 * to be run right away
>> -	 */
>> -	int flushing;
>> +	unsigned long flags;
>>  
>>  	u64 run_delayed_start;
>>  
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index f51f9e39bcee..ccd37fbe5db1 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -909,9 +909,9 @@ bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
>>  {
>>  	struct btrfs_transaction *cur_trans = trans->transaction;
>>  
>> -	smp_mb();
>>  	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
>> -	    cur_trans->delayed_refs.flushing)
>> +	    test_bit(BTRFS_DELAYED_REFS_FLUSHING,
>> +		     &cur_trans->delayed_refs.flags))
>>  		return true;
>>  
>>  	return should_end_transaction(trans);
>> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>  	btrfs_trans_release_metadata(trans);
>>  	trans->block_rsv = NULL;
>>  
>> -	/* make a pass through all the delayed refs we have so far
>> -	 * any runnings procs may add more while we are here
>> -	 */
>> -	ret = btrfs_run_delayed_refs(trans, 0);
>> -	if (ret) {
>> -		btrfs_end_transaction(trans);
>> -		return ret;
>> -	}
>> -
>> -	cur_trans = trans->transaction;
>> -
>>  	/*
>> -	 * set the flushing flag so procs in this transaction have to
>> -	 * start sending their work down.
>> +	 * We only want one transaction commit doing the flushing so we do not
>> +	 * waste a bunch of time on lock contention on the extent root node.
>>  	 */
>> -	cur_trans->delayed_refs.flushing = 1;
>> -	smp_wmb();
> 
> This barrier obviously separates the flushing = 1 and the rest of the
> code, now implemented as test_and_set_bit, which implies full barrier.
> 
> However, hunk in btrfs_should_end_transaction removes the barrier and
> I'm not sure whether this is correct:
> 
> -	smp_mb();
>  	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
> -	    cur_trans->delayed_refs.flushing)
> +	    test_bit(BTRFS_DELAYED_REFS_FLUSHING,
> +		     &cur_trans->delayed_refs.flags))
>  		return true;
> 
> This is never called under locks so we don't have complete
> synchronization of neither the transaction state nor the flushing bit.
> btrfs_should_end_transaction is merely a hint and not called in critical
> places so we could probably afford to keep it without a barrier, or keep
> it with comment(s).

I think the point is moot in this case, because the test_bit either sees
the flag or it doesn't. It's not possible for the flag to be set AND
should_end_transaction return false that would be gross violation of
program correctness.

> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit
  2021-01-11  8:33     ` Nikolay Borisov
@ 2021-01-11 21:50       ` David Sterba
  2021-01-12  9:17         ` Nikolay Borisov
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2021-01-11 21:50 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, Josef Bacik, linux-btrfs, kernel-team

On Mon, Jan 11, 2021 at 10:33:42AM +0200, Nikolay Borisov wrote:
> On 8.01.21 г. 18:01 ч., David Sterba wrote:
> > On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
> >> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> >>  	btrfs_trans_release_metadata(trans);
> >>  	trans->block_rsv = NULL;
> >>  
> >> -	/* make a pass through all the delayed refs we have so far
> >> -	 * any runnings procs may add more while we are here
> >> -	 */
> >> -	ret = btrfs_run_delayed_refs(trans, 0);
> >> -	if (ret) {
> >> -		btrfs_end_transaction(trans);
> >> -		return ret;
> >> -	}
> >> -
> >> -	cur_trans = trans->transaction;
> >> -
> >>  	/*
> >> -	 * set the flushing flag so procs in this transaction have to
> >> -	 * start sending their work down.
> >> +	 * We only want one transaction commit doing the flushing so we do not
> >> +	 * waste a bunch of time on lock contention on the extent root node.
> >>  	 */
> >> -	cur_trans->delayed_refs.flushing = 1;
> >> -	smp_wmb();
> > 
> > This barrier obviously separates the flushing = 1 and the rest of the
> > code, now implemented as test_and_set_bit, which implies full barrier.
> > 
> > However, hunk in btrfs_should_end_transaction removes the barrier and
> > I'm not sure whether this is correct:
> > 
> > -	smp_mb();
> >  	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
> > -	    cur_trans->delayed_refs.flushing)
> > +	    test_bit(BTRFS_DELAYED_REFS_FLUSHING,
> > +		     &cur_trans->delayed_refs.flags))
> >  		return true;
> > 
> > This is never called under locks so we don't have complete
> > synchronization of neither the transaction state nor the flushing bit.
> > btrfs_should_end_transaction is merely a hint and not called in critical
> > places so we could probably afford to keep it without a barrier, or keep
> > it with comment(s).
> 
> I think the point is moot in this case, because the test_bit either sees
> the flag or it doesn't. It's not possible for the flag to be set AND
> should_end_transaction return false that would be gross violation of
> program correctness.

So that's for the flushing part, but what about cur_trans->state?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit
  2021-01-11 21:50       ` David Sterba
@ 2021-01-12  9:17         ` Nikolay Borisov
  2021-01-26 17:36           ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Nikolay Borisov @ 2021-01-12  9:17 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, Josef Bacik, linux-btrfs, kernel-team



On 11.01.21 г. 23:50 ч., David Sterba wrote:
> On Mon, Jan 11, 2021 at 10:33:42AM +0200, Nikolay Borisov wrote:
>> On 8.01.21 г. 18:01 ч., David Sterba wrote:
>>> On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
>>>> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>>  	btrfs_trans_release_metadata(trans);
>>>>  	trans->block_rsv = NULL;
>>>>  
>>>> -	/* make a pass through all the delayed refs we have so far
>>>> -	 * any runnings procs may add more while we are here
>>>> -	 */
>>>> -	ret = btrfs_run_delayed_refs(trans, 0);
>>>> -	if (ret) {
>>>> -		btrfs_end_transaction(trans);
>>>> -		return ret;
>>>> -	}
>>>> -
>>>> -	cur_trans = trans->transaction;
>>>> -
>>>>  	/*
>>>> -	 * set the flushing flag so procs in this transaction have to
>>>> -	 * start sending their work down.
>>>> +	 * We only want one transaction commit doing the flushing so we do not
>>>> +	 * waste a bunch of time on lock contention on the extent root node.
>>>>  	 */
>>>> -	cur_trans->delayed_refs.flushing = 1;
>>>> -	smp_wmb();
>>>
>>> This barrier obviously separates the flushing = 1 and the rest of the
>>> code, now implemented as test_and_set_bit, which implies full barrier.
>>>
>>> However, hunk in btrfs_should_end_transaction removes the barrier and
>>> I'm not sure whether this is correct:
>>>
>>> -	smp_mb();
>>>  	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
>>> -	    cur_trans->delayed_refs.flushing)
>>> +	    test_bit(BTRFS_DELAYED_REFS_FLUSHING,
>>> +		     &cur_trans->delayed_refs.flags))
>>>  		return true;
>>>
>>> This is never called under locks so we don't have complete
>>> synchronization of neither the transaction state nor the flushing bit.
>>> btrfs_should_end_transaction is merely a hint and not called in critical
>>> places so we could probably afford to keep it without a barrier, or keep
>>> it with comment(s).
>>
>> I think the point is moot in this case, because the test_bit either sees
>> the flag or it doesn't. It's not possible for the flag to be set AND
>> should_end_transaction return false that would be gross violation of
>> program correctness.
> 
> So that's for the flushing part, but what about cur_trans->state?

Looking at the code, the barrier was there to order the publishing of
the delayed_ref.flushing (now replaced by the bit flag) against
surrounding code.

So independently of this patch, let's reason about trans state. In
should_end_transaction it's read without holding any locks. (U)

It's modified in btrfs_cleanup_transaction without holding the
fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.

set in cleanup_transaction under fs_info->trans_lock (L)
set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L)

set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
point the transaction is finished and fs_info->running_trans is NULL (U
but irrelevant).

So by the looks of it we can have a concurrent READ race with a Write,
due to reads not taking a lock. In this case what we want to ensure is
we either see new or old state. I consulted with Will Deacon and he said
that in such a case we'd want to annotate the accesses to ->state with
(READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
don't think this could happen but I imagine at some point kcsan would
flag such an access as racy (which it is).

> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit
  2021-01-12  9:17         ` Nikolay Borisov
@ 2021-01-26 17:36           ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-01-26 17:36 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, Josef Bacik, linux-btrfs, kernel-team

On Tue, Jan 12, 2021 at 11:17:45AM +0200, Nikolay Borisov wrote:
> 
> 
> On 11.01.21 г. 23:50 ч., David Sterba wrote:
> > On Mon, Jan 11, 2021 at 10:33:42AM +0200, Nikolay Borisov wrote:
> >> On 8.01.21 г. 18:01 ч., David Sterba wrote:
> >>> On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
> >>>> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> >>>>  	btrfs_trans_release_metadata(trans);
> >>>>  	trans->block_rsv = NULL;
> >>>>  
> >>>> -	/* make a pass through all the delayed refs we have so far
> >>>> -	 * any runnings procs may add more while we are here
> >>>> -	 */
> >>>> -	ret = btrfs_run_delayed_refs(trans, 0);
> >>>> -	if (ret) {
> >>>> -		btrfs_end_transaction(trans);
> >>>> -		return ret;
> >>>> -	}
> >>>> -
> >>>> -	cur_trans = trans->transaction;
> >>>> -
> >>>>  	/*
> >>>> -	 * set the flushing flag so procs in this transaction have to
> >>>> -	 * start sending their work down.
> >>>> +	 * We only want one transaction commit doing the flushing so we do not
> >>>> +	 * waste a bunch of time on lock contention on the extent root node.
> >>>>  	 */
> >>>> -	cur_trans->delayed_refs.flushing = 1;
> >>>> -	smp_wmb();
> >>>
> >>> This barrier obviously separates the flushing = 1 and the rest of the
> >>> code, now implemented as test_and_set_bit, which implies full barrier.
> >>>
> >>> However, hunk in btrfs_should_end_transaction removes the barrier and
> >>> I'm not sure whether this is correct:
> >>>
> >>> -	smp_mb();
> >>>  	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
> >>> -	    cur_trans->delayed_refs.flushing)
> >>> +	    test_bit(BTRFS_DELAYED_REFS_FLUSHING,
> >>> +		     &cur_trans->delayed_refs.flags))
> >>>  		return true;
> >>>
> >>> This is never called under locks so we don't have complete
> >>> synchronization of neither the transaction state nor the flushing bit.
> >>> btrfs_should_end_transaction is merely a hint and not called in critical
> >>> places so we could probably afford to keep it without a barrier, or keep
> >>> it with comment(s).
> >>
> >> I think the point is moot in this case, because the test_bit either sees
> >> the flag or it doesn't. It's not possible for the flag to be set AND
> >> should_end_transaction return false that would be gross violation of
> >> program correctness.
> > 
> > So that's for the flushing part, but what about cur_trans->state?
> 
> Looking at the code, the barrier was there to order the publishing of
> the delayed_ref.flushing (now replaced by the bit flag) against
> surrounding code.
> 
> So independently of this patch, let's reason about trans state. In
> should_end_transaction it's read without holding any locks. (U)
> 
> It's modified in btrfs_cleanup_transaction without holding the
> fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.
> 
> set in cleanup_transaction under fs_info->trans_lock (L)
> set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
> set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
> set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L)
> 
> set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
> point the transaction is finished and fs_info->running_trans is NULL (U
> but irrelevant).
> 
> So by the looks of it we can have a concurrent READ race with a Write,
> due to reads not taking a lock. In this case what we want to ensure is
> we either see new or old state. I consulted with Will Deacon and he said
> that in such a case we'd want to annotate the accesses to ->state with
> (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
> don't think this could happen but I imagine at some point kcsan would
> flag such an access as racy (which it is).

Thanks for the analysis, I've copied it to the changelog as there's
probably no shorter way to explain it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/8] A variety of lock contention fixes
  2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
                   ` (7 preceding siblings ...)
  2020-12-18 19:24 ` [PATCH v5 8/8] btrfs: run delayed refs less often in commit_cowonly_roots Josef Bacik
@ 2021-01-26 17:51 ` David Sterba
  8 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-01-26 17:51 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Dec 18, 2020 at 02:24:18PM -0500, Josef Bacik wrote:
> v4->v5:
> - Added "btrfs: remove bogus BUG_ON in alloc_reserved_tree_block", as Nikolay
>   pointed out I needed to explain why we no longer needed one of the delayed ref
>   flushes, which led me down the rabbit hole of trying to understand why it
>   wasn't a problem anymore.  Turned out the BUG_ON() is bogus.
> - Added "btrfs: move delayed ref flushing for qgroup into qgroup helper",
>   instead of removing the flushing for qgroups completely, we still sort of need
>   it, even though it's actually still broken, so I've moved it into the qgroup
>   helper.
> - Added Nikolay's rb for the last patch.

Moved from topic branch to misc-next, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-01-27 10:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-18 19:24 [PATCH v5 0/8] A variety of lock contention fixes Josef Bacik
2020-12-18 19:24 ` [PATCH v5 1/8] btrfs: do not block on deleted bgs mutex in the cleaner Josef Bacik
2020-12-18 19:24 ` [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit Josef Bacik
2021-01-08 16:01   ` David Sterba
2021-01-11  8:33     ` Nikolay Borisov
2021-01-11 21:50       ` David Sterba
2021-01-12  9:17         ` Nikolay Borisov
2021-01-26 17:36           ` David Sterba
2020-12-18 19:24 ` [PATCH v5 3/8] btrfs: delayed refs pre-flushing should only run the heads we have Josef Bacik
2020-12-18 19:24 ` [PATCH v5 4/8] btrfs: only run delayed refs once before committing Josef Bacik
2020-12-18 19:24 ` [PATCH v5 5/8] btrfs: move delayed ref flushing for qgroup into qgroup helper Josef Bacik
2020-12-18 19:24 ` [PATCH v5 6/8] btrfs: remove bogus BUG_ON in alloc_reserved_tree_block Josef Bacik
2020-12-18 19:24 ` [PATCH v5 7/8] btrfs: stop running all delayed refs during snapshot Josef Bacik
2020-12-18 19:24 ` [PATCH v5 8/8] btrfs: run delayed refs less often in commit_cowonly_roots Josef Bacik
2021-01-26 17:51 ` [PATCH v5 0/8] A variety of lock contention fixes David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.