v4->v5: - Added "btrfs: remove bogus BUG_ON in alloc_reserved_tree_block", as Nikolay pointed out I needed to explain why we no longer needed one of the delayed ref flushes, which led me down the rabbit hole of trying to understand why it wasn't a problem anymore. Turned out the BUG_ON() is bogus. - Added "btrfs: move delayed ref flushing for qgroup into qgroup helper", instead of removing the flushing for qgroups completely, we still sort of need it, even though it's actually still broken, so I've moved it into the qgroup helper. - Added Nikolay's rb for the last patch. v3->v4: - I accidentally sent out the v1 version of these patches, because I had fixed them on another machine. This is the proper set with the changes from v2 that are properly rebased onto misc-next. v2->v3: - Added Nikolay's reviewed by for the second patch. - Rebased onto the latest misc-next. v1->v2: - Fixed the log messages that Nikolay pointed out. - Added Nikolay's reviewed by for the first patch. - Removed the unneeded mb for flushing. --- Original email --- Hello, I've been running some stress tests recently in order to try and reproduce some problems I've tripped over in relocation. Most of this series is a reposting of patches I wrote when debugging related issues for Zygo that got lost. I've updated one of them to make the lock contention even better, making it so I have to ramp up my stress test loops because it now finishes way too fast. Thanks, Josef Josef Bacik (8): btrfs: do not block on deleted bgs mutex in the cleaner btrfs: only let one thread pre-flush delayed refs in commit btrfs: delayed refs pre-flushing should only run the heads we have btrfs: only run delayed refs once before committing btrfs: move delayed ref flushing for qgroup into qgroup helper btrfs: remove bogus BUG_ON in alloc_reserved_tree_block btrfs: stop running all delayed refs during snapshot btrfs: run delayed refs less often in commit_cowonly_roots fs/btrfs/block-group.c | 11 +++-- fs/btrfs/delayed-ref.h | 12 +++--- fs/btrfs/extent-tree.c | 3 +- fs/btrfs/transaction.c | 91 +++++++++++++++++++++--------------------- 4 files changed, 60 insertions(+), 57 deletions(-) -- 2.26.2
While running some stress tests I started getting hung task messages. This is because the delete unused bg's code has to take the delete_unused_bgs_mutex to do it's work, which is taken by balance to make sure we don't delete block groups while we're balancing. The problem is a balance can take a while, and so we were getting hung task warnings. We don't need to block and run these things, and the cleaner is needed to do other work, so trylock on this mutex and just bail if we can't acquire it right away. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/block-group.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 52f2198d44c9..f61e275bd792 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1254,6 +1254,13 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags)) return; + /* + * Long running balances can keep us blocked here for eternity, so + * simply skip deletion if we're unable to get the mutex. + */ + if (!mutex_trylock(&fs_info->delete_unused_bgs_mutex)) + return; + spin_lock(&fs_info->unused_bgs_lock); while (!list_empty(&fs_info->unused_bgs)) { int trimming; @@ -1273,8 +1280,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group); - mutex_lock(&fs_info->delete_unused_bgs_mutex); - /* Don't want to race with allocators so take the groups_sem */ down_write(&space_info->groups_sem); @@ -1420,11 +1425,11 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) end_trans: btrfs_end_transaction(trans); next: - mutex_unlock(&fs_info->delete_unused_bgs_mutex); btrfs_put_block_group(block_group); spin_lock(&fs_info->unused_bgs_lock); } spin_unlock(&fs_info->unused_bgs_lock); + mutex_unlock(&fs_info->delete_unused_bgs_mutex); return; flip_async: -- 2.26.2
I've been running a stress test that runs 20 workers in their own subvolume, which are running an fsstress instance with 4 threads per worker, which is 80 total fsstress threads. In addition to this I'm running balance in the background as well as creating and deleting snapshots. This test takes around 12 hours to run normally, going slower and slower as the test goes on. The reason for this is because fsstress is running fsync sometimes, and because we're messing with block groups we often fall through to btrfs_commit_transaction, so will often have 20-30 threads all calling btrfs_commit_transaction at the same time. These all get stuck contending on the extent tree while they try to run delayed refs during the initial part of the commit. This is suboptimal, really because the extent tree is a single point of failure we only want one thread acting on that tree at once to reduce lock contention. Fix this by making the flushing mechanism a bit operation, to make it easy to use test_and_set_bit() in order to make sure only one task does this initial flush. Once we're into the transaction commit we only have one thread doing delayed ref running, it's just this initial pre-flush that is problematic. With this patch my stress test takes around 90 minutes to run, instead of 12 hours. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/delayed-ref.h | 12 ++++++------ fs/btrfs/transaction.c | 33 ++++++++++++++++----------------- 2 files changed, 22 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 1c977e6d45dc..6e414785b56f 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -135,6 +135,11 @@ struct btrfs_delayed_data_ref { u64 offset; }; +enum btrfs_delayed_ref_flags { + /* Used to indicate that we are flushing delayed refs for the commit. */ + BTRFS_DELAYED_REFS_FLUSHING, +}; + struct btrfs_delayed_ref_root { /* head ref rbtree */ struct rb_root_cached href_root; @@ -158,12 +163,7 @@ struct btrfs_delayed_ref_root { u64 pending_csums; - /* - * set when the tree is flushing before a transaction commit, - * used by the throttling code to decide if new updates need - * to be run right away - */ - int flushing; + unsigned long flags; u64 run_delayed_start; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index f51f9e39bcee..ccd37fbe5db1 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -909,9 +909,9 @@ bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans) { struct btrfs_transaction *cur_trans = trans->transaction; - smp_mb(); if (cur_trans->state >= TRANS_STATE_COMMIT_START || - cur_trans->delayed_refs.flushing) + test_bit(BTRFS_DELAYED_REFS_FLUSHING, + &cur_trans->delayed_refs.flags)) return true; return should_end_transaction(trans); @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - /* make a pass through all the delayed refs we have so far - * any runnings procs may add more while we are here - */ - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - - cur_trans = trans->transaction; - /* - * set the flushing flag so procs in this transaction have to - * start sending their work down. + * We only want one transaction commit doing the flushing so we do not + * waste a bunch of time on lock contention on the extent root node. */ - cur_trans->delayed_refs.flushing = 1; - smp_wmb(); + if (!test_and_set_bit(BTRFS_DELAYED_REFS_FLUSHING, + &cur_trans->delayed_refs.flags)) { + /* + * make a pass through all the delayed refs we have so far + * any runnings procs may add more while we are here + */ + ret = btrfs_run_delayed_refs(trans, 0); + if (ret) { + btrfs_end_transaction(trans); + return ret; + } + } btrfs_create_pending_block_groups(trans); -- 2.26.2
Previously our delayed ref running used the total number of items as the items to run. However we changed that to number of heads to run with the delayed_refs_rsv, as generally we want to run all of the operations for one bytenr. But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the number of items that we have. This is generally fine, but if we have some operation generation loads of delayed refs while we're doing this pre-flushing in the transaction commit, we'll just spin forever doing delayed refs. Fix this to simply pick the number of delayed refs we currently have, that way we do not end up doing a lot of extra work that's being generated in other threads. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/extent-tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d79b8369e6aa..b6d774803a2c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2160,7 +2160,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, delayed_refs = &trans->transaction->delayed_refs; if (count == 0) - count = atomic_read(&delayed_refs->num_entries) * 2; + count = delayed_refs->num_heads_ready; again: #ifdef SCRAMBLE_DELAYED_REFS -- 2.26.2
We try to pre-flush the delayed refs when committing, because we want to do as little work as possible in the critical section of the transaction commit. However doing this twice can lead to very long transaction commit delays as other threads are allowed to continue to generate more delayed refs, which potentially delays the commit by multiple minutes in very extreme cases. So simply stick to one pre-flush, and then continue the rest of the transaction commit. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/transaction.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index ccd37fbe5db1..4776e055f7f9 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2062,12 +2062,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_create_pending_block_groups(trans); - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &cur_trans->flags)) { int run_it = 0; -- 2.26.2
The commit d67263354541 ("btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.") added a flush of the delayed refs during snapshot creation in order to get the qgroup accounting properly. However this code has changed and been moved to it's own helper that is skipped if qgroups are turned off. Move the flushing to the helper, as we do not need it when qgroups are turned off. Also add a comment explaining why it exists, and why it doesn't actually save us. This will be helpful later when we try to fix qgroup accounting properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/transaction.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 4776e055f7f9..5738763c514b 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1433,6 +1433,23 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans, */ record_root_in_trans(trans, src, 1); + /* + * btrfs_qgroup_inherit relies on a consistent view of the usage for the + * src root, so we must run the delayed refs here. + * + * However this isn't particularly fool proof, because there's no + * synchronization keeping us from changing the tree after this point + * before we do the qgroup_inherit, or even from making changes while + * we're doing the qgroup_inherit. But that's a problem for the future, + * for now flush the delayed refs to narrow the race window where the + * qgroup counters could end up wrong. + */ + ret = btrfs_run_delayed_refs(trans, (unsigned long)-1); + if (ret) { + btrfs_abort_transaction(trans, ret); + goto out; + } + /* * We are going to commit transaction, see btrfs_commit_transaction() * comment for reason locking tree_log_mutex @@ -1686,12 +1703,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, goto fail; } - ret = btrfs_run_delayed_refs(trans, (unsigned long)-1); - if (ret) { - btrfs_abort_transaction(trans, ret); - goto fail; - } - /* * Do special qgroup accounting for snapshot, as we do some qgroup * snapshot hack to do fast snapshot. -- 2.26.2
The fix 361048f586f5 ("Btrfs: fix full backref problem when inserting shared block reference") added a delayed ref flushing at subvolume creation time in order to avoid hitting this particular BUG_ON(). Before this fix, we were tripping the BUG_ON() by 1. Modify snapshot A, which creates blocks with a normal reference for snapshot A, as A is the owner of these blocks. We now have delayed refs for these blocks. 2. Create a snapshot of A named B, which pushes references for the children blocks of the root node for the new root B, thus creating more delayed refs for newly allocated blocks. 3. A is modified, and because the metadata blocks can now be shared, it must push FULL_BACKREF references to the children of any block that A cow's down it's path to its target key. 4. Delayed refs are run. Because these are newly allocated blocks, we have ->must_insert_reserved reserved set on the delayed ref head, we call into alloc_reserved_tree_block() to add the extent item, and then add our ref. At the time of this fix, we were ordering FULL_BACKREF delayed ref operations first, so we'd go to add this reference and then BUG_ON() because we didn't have the FULL_BACKREF flag set. The patch fixed this problem by making sure we ran the delayed refs before we had the chance to modify A. This meant that any *new* blocks would have had their extent items created _before_ we would ever actually cow down and generate FULL_BACKREF entries. Thus the problem went away. However this BUG_ON() is actually completely bogus. The existence of a full backref doesn't necessarily mean that FULL_BACKREF must be set on that block, it must only be set on the actual parent itself. Consider the example provided above. If we cow down one path from A, any nodes are going to have a FULL_BACKREF ref pushed down to _all_ of their children, but not all of the children are going to have FULL_BACKREF set. It is completely valid to have an extent item with normal and full back refs without FULL_BACKREF actually set on the block itself. As a final note, I have been testing with the patch btrfs: stop running all delayed refs during snapshot which removed this flushing. My test was a torture test which did a lot of operations while snapshotting and deleting snapshots as well as relocation, and I never tripped this BUG_ON(). This is actually because at the time of 361048f586f5, we ordered SHARED keys _before_ normal references, and thus they would get run first. However currently they are ordered _after_ normal references, so we'd do the initial creation without having a shared reference, and thus not hit this BUG_ON(), which explains why I didn't start hitting this problem during my testing with my other patch applied. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/extent-tree.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index b6d774803a2c..c4846694ae9c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4516,7 +4516,6 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans, } if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) { - BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)); btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_SHARED_BLOCK_REF_KEY); btrfs_set_extent_inline_ref_offset(leaf, iref, ref->parent); -- 2.26.2
This was added in commit 361048f586f5 ("Btrfs: fix full backref problem when inserting shared block reference") to address a problem where we hit the following BUG_ON() in alloc_reserved_tree_block if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) { BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)); However this BUG_ON() is bogus, and was removed by btrfs: remove bogus BUG_ON in alloc_reserved_tree_block We no longer need to run delayed refs because of this, and can remove this flushing here. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/transaction.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 5738763c514b..b3c9da5833db 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1750,12 +1750,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, } } - ret = btrfs_run_delayed_refs(trans, (unsigned long)-1); - if (ret) { - btrfs_abort_transaction(trans, ret); - goto fail; - } - fail: pending->error = ret; dir_item_existed: -- 2.26.2
We love running delayed refs in commit_cowonly_roots, but it is a bit excessive. I was seeing cases of running 3 or 4 refs a few times in a row during this time. Instead simply update all of the roots first, then run delayed refs, then handle the empty block groups case, and then if we have any more dirty roots do the whole thing again. This allows us to be much more efficient with our delayed ref running, as we can batch a few more operations at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/transaction.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index b3c9da5833db..8a6b38c01277 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1227,10 +1227,6 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans) btrfs_tree_unlock(eb); free_extent_buffer(eb); - if (ret) - return ret; - - ret = btrfs_run_delayed_refs(trans, (unsigned long)-1); if (ret) return ret; @@ -1248,10 +1244,6 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans) if (ret) return ret; - /* run_qgroups might have added some more refs */ - ret = btrfs_run_delayed_refs(trans, (unsigned long)-1); - if (ret) - return ret; again: while (!list_empty(&fs_info->dirty_cowonly_roots)) { struct btrfs_root *root; @@ -1266,15 +1258,24 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans) ret = update_cowonly_root(trans, root); if (ret) return ret; - ret = btrfs_run_delayed_refs(trans, (unsigned long)-1); - if (ret) - return ret; } + /* Now flush any delayed refs generated by updating all of the roots. */ + ret = btrfs_run_delayed_refs(trans, (unsigned long)-1); + if (ret) + return ret; + while (!list_empty(dirty_bgs) || !list_empty(io_bgs)) { ret = btrfs_write_dirty_block_groups(trans); if (ret) return ret; + + /* + * We're writing the dirty block groups, which could generate + * delayed refs, which could generate more dirty block groups, + * so we want to keep this flushing in this loop to make sure + * everything gets run. + */ ret = btrfs_run_delayed_refs(trans, (unsigned long)-1); if (ret) return ret; -- 2.26.2
On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
> I've been running a stress test that runs 20 workers in their own
> subvolume, which are running an fsstress instance with 4 threads per
> worker, which is 80 total fsstress threads. In addition to this I'm
> running balance in the background as well as creating and deleting
> snapshots. This test takes around 12 hours to run normally, going
> slower and slower as the test goes on.
>
> The reason for this is because fsstress is running fsync sometimes, and
> because we're messing with block groups we often fall through to
> btrfs_commit_transaction, so will often have 20-30 threads all calling
> btrfs_commit_transaction at the same time.
>
> These all get stuck contending on the extent tree while they try to run
> delayed refs during the initial part of the commit.
>
> This is suboptimal, really because the extent tree is a single point of
> failure we only want one thread acting on that tree at once to reduce
> lock contention. Fix this by making the flushing mechanism a bit
> operation, to make it easy to use test_and_set_bit() in order to make
> sure only one task does this initial flush.
>
> Once we're into the transaction commit we only have one thread doing
> delayed ref running, it's just this initial pre-flush that is
> problematic. With this patch my stress test takes around 90 minutes to
> run, instead of 12 hours.
>
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/btrfs/delayed-ref.h | 12 ++++++------
> fs/btrfs/transaction.c | 33 ++++++++++++++++-----------------
> 2 files changed, 22 insertions(+), 23 deletions(-)
>
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 1c977e6d45dc..6e414785b56f 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -135,6 +135,11 @@ struct btrfs_delayed_data_ref {
> u64 offset;
> };
>
> +enum btrfs_delayed_ref_flags {
> + /* Used to indicate that we are flushing delayed refs for the commit. */
> + BTRFS_DELAYED_REFS_FLUSHING,
> +};
> +
> struct btrfs_delayed_ref_root {
> /* head ref rbtree */
> struct rb_root_cached href_root;
> @@ -158,12 +163,7 @@ struct btrfs_delayed_ref_root {
>
> u64 pending_csums;
>
> - /*
> - * set when the tree is flushing before a transaction commit,
> - * used by the throttling code to decide if new updates need
> - * to be run right away
> - */
> - int flushing;
> + unsigned long flags;
>
> u64 run_delayed_start;
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f51f9e39bcee..ccd37fbe5db1 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -909,9 +909,9 @@ bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
> {
> struct btrfs_transaction *cur_trans = trans->transaction;
>
> - smp_mb();
> if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
> - cur_trans->delayed_refs.flushing)
> + test_bit(BTRFS_DELAYED_REFS_FLUSHING,
> + &cur_trans->delayed_refs.flags))
> return true;
>
> return should_end_transaction(trans);
> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> btrfs_trans_release_metadata(trans);
> trans->block_rsv = NULL;
>
> - /* make a pass through all the delayed refs we have so far
> - * any runnings procs may add more while we are here
> - */
> - ret = btrfs_run_delayed_refs(trans, 0);
> - if (ret) {
> - btrfs_end_transaction(trans);
> - return ret;
> - }
> -
> - cur_trans = trans->transaction;
> -
> /*
> - * set the flushing flag so procs in this transaction have to
> - * start sending their work down.
> + * We only want one transaction commit doing the flushing so we do not
> + * waste a bunch of time on lock contention on the extent root node.
> */
> - cur_trans->delayed_refs.flushing = 1;
> - smp_wmb();
This barrier obviously separates the flushing = 1 and the rest of the
code, now implemented as test_and_set_bit, which implies full barrier.
However, hunk in btrfs_should_end_transaction removes the barrier and
I'm not sure whether this is correct:
- smp_mb();
if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
- cur_trans->delayed_refs.flushing)
+ test_bit(BTRFS_DELAYED_REFS_FLUSHING,
+ &cur_trans->delayed_refs.flags))
return true;
This is never called under locks so we don't have complete
synchronization of neither the transaction state nor the flushing bit.
btrfs_should_end_transaction is merely a hint and not called in critical
places so we could probably afford to keep it without a barrier, or keep
it with comment(s).
On 8.01.21 г. 18:01 ч., David Sterba wrote: > On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote: >> I've been running a stress test that runs 20 workers in their own >> subvolume, which are running an fsstress instance with 4 threads per >> worker, which is 80 total fsstress threads. In addition to this I'm >> running balance in the background as well as creating and deleting >> snapshots. This test takes around 12 hours to run normally, going >> slower and slower as the test goes on. >> >> The reason for this is because fsstress is running fsync sometimes, and >> because we're messing with block groups we often fall through to >> btrfs_commit_transaction, so will often have 20-30 threads all calling >> btrfs_commit_transaction at the same time. >> >> These all get stuck contending on the extent tree while they try to run >> delayed refs during the initial part of the commit. >> >> This is suboptimal, really because the extent tree is a single point of >> failure we only want one thread acting on that tree at once to reduce >> lock contention. Fix this by making the flushing mechanism a bit >> operation, to make it easy to use test_and_set_bit() in order to make >> sure only one task does this initial flush. >> >> Once we're into the transaction commit we only have one thread doing >> delayed ref running, it's just this initial pre-flush that is >> problematic. With this patch my stress test takes around 90 minutes to >> run, instead of 12 hours. >> >> Reviewed-by: Nikolay Borisov <nborisov@suse.com> >> Signed-off-by: Josef Bacik <josef@toxicpanda.com> >> --- >> fs/btrfs/delayed-ref.h | 12 ++++++------ >> fs/btrfs/transaction.c | 33 ++++++++++++++++----------------- >> 2 files changed, 22 insertions(+), 23 deletions(-) >> >> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h >> index 1c977e6d45dc..6e414785b56f 100644 >> --- a/fs/btrfs/delayed-ref.h >> +++ b/fs/btrfs/delayed-ref.h >> @@ -135,6 +135,11 @@ struct btrfs_delayed_data_ref { >> u64 offset; >> }; >> >> +enum btrfs_delayed_ref_flags { >> + /* Used to indicate that we are flushing delayed refs for the commit. */ >> + BTRFS_DELAYED_REFS_FLUSHING, >> +}; >> + >> struct btrfs_delayed_ref_root { >> /* head ref rbtree */ >> struct rb_root_cached href_root; >> @@ -158,12 +163,7 @@ struct btrfs_delayed_ref_root { >> >> u64 pending_csums; >> >> - /* >> - * set when the tree is flushing before a transaction commit, >> - * used by the throttling code to decide if new updates need >> - * to be run right away >> - */ >> - int flushing; >> + unsigned long flags; >> >> u64 run_delayed_start; >> >> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c >> index f51f9e39bcee..ccd37fbe5db1 100644 >> --- a/fs/btrfs/transaction.c >> +++ b/fs/btrfs/transaction.c >> @@ -909,9 +909,9 @@ bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans) >> { >> struct btrfs_transaction *cur_trans = trans->transaction; >> >> - smp_mb(); >> if (cur_trans->state >= TRANS_STATE_COMMIT_START || >> - cur_trans->delayed_refs.flushing) >> + test_bit(BTRFS_DELAYED_REFS_FLUSHING, >> + &cur_trans->delayed_refs.flags)) >> return true; >> >> return should_end_transaction(trans); >> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) >> btrfs_trans_release_metadata(trans); >> trans->block_rsv = NULL; >> >> - /* make a pass through all the delayed refs we have so far >> - * any runnings procs may add more while we are here >> - */ >> - ret = btrfs_run_delayed_refs(trans, 0); >> - if (ret) { >> - btrfs_end_transaction(trans); >> - return ret; >> - } >> - >> - cur_trans = trans->transaction; >> - >> /* >> - * set the flushing flag so procs in this transaction have to >> - * start sending their work down. >> + * We only want one transaction commit doing the flushing so we do not >> + * waste a bunch of time on lock contention on the extent root node. >> */ >> - cur_trans->delayed_refs.flushing = 1; >> - smp_wmb(); > > This barrier obviously separates the flushing = 1 and the rest of the > code, now implemented as test_and_set_bit, which implies full barrier. > > However, hunk in btrfs_should_end_transaction removes the barrier and > I'm not sure whether this is correct: > > - smp_mb(); > if (cur_trans->state >= TRANS_STATE_COMMIT_START || > - cur_trans->delayed_refs.flushing) > + test_bit(BTRFS_DELAYED_REFS_FLUSHING, > + &cur_trans->delayed_refs.flags)) > return true; > > This is never called under locks so we don't have complete > synchronization of neither the transaction state nor the flushing bit. > btrfs_should_end_transaction is merely a hint and not called in critical > places so we could probably afford to keep it without a barrier, or keep > it with comment(s). I think the point is moot in this case, because the test_bit either sees the flag or it doesn't. It's not possible for the flag to be set AND should_end_transaction return false that would be gross violation of program correctness. >
On Mon, Jan 11, 2021 at 10:33:42AM +0200, Nikolay Borisov wrote:
> On 8.01.21 г. 18:01 ч., David Sterba wrote:
> > On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
> >> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> >> btrfs_trans_release_metadata(trans);
> >> trans->block_rsv = NULL;
> >>
> >> - /* make a pass through all the delayed refs we have so far
> >> - * any runnings procs may add more while we are here
> >> - */
> >> - ret = btrfs_run_delayed_refs(trans, 0);
> >> - if (ret) {
> >> - btrfs_end_transaction(trans);
> >> - return ret;
> >> - }
> >> -
> >> - cur_trans = trans->transaction;
> >> -
> >> /*
> >> - * set the flushing flag so procs in this transaction have to
> >> - * start sending their work down.
> >> + * We only want one transaction commit doing the flushing so we do not
> >> + * waste a bunch of time on lock contention on the extent root node.
> >> */
> >> - cur_trans->delayed_refs.flushing = 1;
> >> - smp_wmb();
> >
> > This barrier obviously separates the flushing = 1 and the rest of the
> > code, now implemented as test_and_set_bit, which implies full barrier.
> >
> > However, hunk in btrfs_should_end_transaction removes the barrier and
> > I'm not sure whether this is correct:
> >
> > - smp_mb();
> > if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
> > - cur_trans->delayed_refs.flushing)
> > + test_bit(BTRFS_DELAYED_REFS_FLUSHING,
> > + &cur_trans->delayed_refs.flags))
> > return true;
> >
> > This is never called under locks so we don't have complete
> > synchronization of neither the transaction state nor the flushing bit.
> > btrfs_should_end_transaction is merely a hint and not called in critical
> > places so we could probably afford to keep it without a barrier, or keep
> > it with comment(s).
>
> I think the point is moot in this case, because the test_bit either sees
> the flag or it doesn't. It's not possible for the flag to be set AND
> should_end_transaction return false that would be gross violation of
> program correctness.
So that's for the flushing part, but what about cur_trans->state?
On 11.01.21 г. 23:50 ч., David Sterba wrote: > On Mon, Jan 11, 2021 at 10:33:42AM +0200, Nikolay Borisov wrote: >> On 8.01.21 г. 18:01 ч., David Sterba wrote: >>> On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote: >>>> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) >>>> btrfs_trans_release_metadata(trans); >>>> trans->block_rsv = NULL; >>>> >>>> - /* make a pass through all the delayed refs we have so far >>>> - * any runnings procs may add more while we are here >>>> - */ >>>> - ret = btrfs_run_delayed_refs(trans, 0); >>>> - if (ret) { >>>> - btrfs_end_transaction(trans); >>>> - return ret; >>>> - } >>>> - >>>> - cur_trans = trans->transaction; >>>> - >>>> /* >>>> - * set the flushing flag so procs in this transaction have to >>>> - * start sending their work down. >>>> + * We only want one transaction commit doing the flushing so we do not >>>> + * waste a bunch of time on lock contention on the extent root node. >>>> */ >>>> - cur_trans->delayed_refs.flushing = 1; >>>> - smp_wmb(); >>> >>> This barrier obviously separates the flushing = 1 and the rest of the >>> code, now implemented as test_and_set_bit, which implies full barrier. >>> >>> However, hunk in btrfs_should_end_transaction removes the barrier and >>> I'm not sure whether this is correct: >>> >>> - smp_mb(); >>> if (cur_trans->state >= TRANS_STATE_COMMIT_START || >>> - cur_trans->delayed_refs.flushing) >>> + test_bit(BTRFS_DELAYED_REFS_FLUSHING, >>> + &cur_trans->delayed_refs.flags)) >>> return true; >>> >>> This is never called under locks so we don't have complete >>> synchronization of neither the transaction state nor the flushing bit. >>> btrfs_should_end_transaction is merely a hint and not called in critical >>> places so we could probably afford to keep it without a barrier, or keep >>> it with comment(s). >> >> I think the point is moot in this case, because the test_bit either sees >> the flag or it doesn't. It's not possible for the flag to be set AND >> should_end_transaction return false that would be gross violation of >> program correctness. > > So that's for the flushing part, but what about cur_trans->state? Looking at the code, the barrier was there to order the publishing of the delayed_ref.flushing (now replaced by the bit flag) against surrounding code. So independently of this patch, let's reason about trans state. In should_end_transaction it's read without holding any locks. (U) It's modified in btrfs_cleanup_transaction without holding the fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set. set in cleanup_transaction under fs_info->trans_lock (L) set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this point the transaction is finished and fs_info->running_trans is NULL (U but irrelevant). So by the looks of it we can have a concurrent READ race with a Write, due to reads not taking a lock. In this case what we want to ensure is we either see new or old state. I consulted with Will Deacon and he said that in such a case we'd want to annotate the accesses to ->state with (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I don't think this could happen but I imagine at some point kcsan would flag such an access as racy (which it is). >
On Tue, Jan 12, 2021 at 11:17:45AM +0200, Nikolay Borisov wrote:
>
>
> On 11.01.21 г. 23:50 ч., David Sterba wrote:
> > On Mon, Jan 11, 2021 at 10:33:42AM +0200, Nikolay Borisov wrote:
> >> On 8.01.21 г. 18:01 ч., David Sterba wrote:
> >>> On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
> >>>> @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> >>>> btrfs_trans_release_metadata(trans);
> >>>> trans->block_rsv = NULL;
> >>>>
> >>>> - /* make a pass through all the delayed refs we have so far
> >>>> - * any runnings procs may add more while we are here
> >>>> - */
> >>>> - ret = btrfs_run_delayed_refs(trans, 0);
> >>>> - if (ret) {
> >>>> - btrfs_end_transaction(trans);
> >>>> - return ret;
> >>>> - }
> >>>> -
> >>>> - cur_trans = trans->transaction;
> >>>> -
> >>>> /*
> >>>> - * set the flushing flag so procs in this transaction have to
> >>>> - * start sending their work down.
> >>>> + * We only want one transaction commit doing the flushing so we do not
> >>>> + * waste a bunch of time on lock contention on the extent root node.
> >>>> */
> >>>> - cur_trans->delayed_refs.flushing = 1;
> >>>> - smp_wmb();
> >>>
> >>> This barrier obviously separates the flushing = 1 and the rest of the
> >>> code, now implemented as test_and_set_bit, which implies full barrier.
> >>>
> >>> However, hunk in btrfs_should_end_transaction removes the barrier and
> >>> I'm not sure whether this is correct:
> >>>
> >>> - smp_mb();
> >>> if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
> >>> - cur_trans->delayed_refs.flushing)
> >>> + test_bit(BTRFS_DELAYED_REFS_FLUSHING,
> >>> + &cur_trans->delayed_refs.flags))
> >>> return true;
> >>>
> >>> This is never called under locks so we don't have complete
> >>> synchronization of neither the transaction state nor the flushing bit.
> >>> btrfs_should_end_transaction is merely a hint and not called in critical
> >>> places so we could probably afford to keep it without a barrier, or keep
> >>> it with comment(s).
> >>
> >> I think the point is moot in this case, because the test_bit either sees
> >> the flag or it doesn't. It's not possible for the flag to be set AND
> >> should_end_transaction return false that would be gross violation of
> >> program correctness.
> >
> > So that's for the flushing part, but what about cur_trans->state?
>
> Looking at the code, the barrier was there to order the publishing of
> the delayed_ref.flushing (now replaced by the bit flag) against
> surrounding code.
>
> So independently of this patch, let's reason about trans state. In
> should_end_transaction it's read without holding any locks. (U)
>
> It's modified in btrfs_cleanup_transaction without holding the
> fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.
>
> set in cleanup_transaction under fs_info->trans_lock (L)
> set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
> set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
> set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L)
>
> set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
> point the transaction is finished and fs_info->running_trans is NULL (U
> but irrelevant).
>
> So by the looks of it we can have a concurrent READ race with a Write,
> due to reads not taking a lock. In this case what we want to ensure is
> we either see new or old state. I consulted with Will Deacon and he said
> that in such a case we'd want to annotate the accesses to ->state with
> (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
> don't think this could happen but I imagine at some point kcsan would
> flag such an access as racy (which it is).
Thanks for the analysis, I've copied it to the changelog as there's
probably no shorter way to explain it.
On Fri, Dec 18, 2020 at 02:24:18PM -0500, Josef Bacik wrote:
> v4->v5:
> - Added "btrfs: remove bogus BUG_ON in alloc_reserved_tree_block", as Nikolay
> pointed out I needed to explain why we no longer needed one of the delayed ref
> flushes, which led me down the rabbit hole of trying to understand why it
> wasn't a problem anymore. Turned out the BUG_ON() is bogus.
> - Added "btrfs: move delayed ref flushing for qgroup into qgroup helper",
> instead of removing the flushing for qgroups completely, we still sort of need
> it, even though it's actually still broken, so I've moved it into the qgroup
> helper.
> - Added Nikolay's rb for the last patch.
Moved from topic branch to misc-next, thanks.