linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 5.11 53/67] btrfs: fix error handling in commit_fs_roots
       [not found] <20210224125026.481804-1-sashal@kernel.org>
@ 2021-02-24 12:50 ` Sasha Levin
  2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 54/67] btrfs: make btrfs_start_delalloc_root's nr argument a long Sasha Levin
  2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 55/67] btrfs: only let one thread pre-flush delayed refs in commit Sasha Levin
  2 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2021-02-24 12:50 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit 4f4317c13a40194940acf4a71670179c4faca2b5 ]

While doing error injection I would sometimes get a corrupt file system.
This is because I was injecting errors at btrfs_search_slot, but would
only do it one time per stack.  This uncovered a problem in
commit_fs_roots, where if we get an error we would just break.  However
we're in a nested loop, the first loop being a loop to find all the
dirty fs roots, and then subsequent root updates would succeed clearing
the error value.

This isn't likely to happen in real scenarios, however we could
potentially get a random ENOMEM once and then not again, and we'd end up
with a corrupted file system.  Fix this by moving the error checking
around a bit to the main loop, as this is the only place where something
will fail, and return the error as soon as it occurs.

With this patch my reproducer no longer corrupts the file system.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/transaction.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 6af7f2bf92de7..fbf93067642ac 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1319,7 +1319,6 @@ static noinline int commit_fs_roots(struct btrfs_trans_handle *trans)
 	struct btrfs_root *gang[8];
 	int i;
 	int ret;
-	int err = 0;
 
 	spin_lock(&fs_info->fs_roots_radix_lock);
 	while (1) {
@@ -1331,6 +1330,8 @@ static noinline int commit_fs_roots(struct btrfs_trans_handle *trans)
 			break;
 		for (i = 0; i < ret; i++) {
 			struct btrfs_root *root = gang[i];
+			int ret2;
+
 			radix_tree_tag_clear(&fs_info->fs_roots_radix,
 					(unsigned long)root->root_key.objectid,
 					BTRFS_ROOT_TRANS_TAG);
@@ -1350,17 +1351,17 @@ static noinline int commit_fs_roots(struct btrfs_trans_handle *trans)
 						    root->node);
 			}
 
-			err = btrfs_update_root(trans, fs_info->tree_root,
+			ret2 = btrfs_update_root(trans, fs_info->tree_root,
 						&root->root_key,
 						&root->root_item);
+			if (ret2)
+				return ret2;
 			spin_lock(&fs_info->fs_roots_radix_lock);
-			if (err)
-				break;
 			btrfs_qgroup_free_meta_all_pertrans(root);
 		}
 	}
 	spin_unlock(&fs_info->fs_roots_radix_lock);
-	return err;
+	return 0;
 }
 
 /*
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH AUTOSEL 5.11 54/67] btrfs: make btrfs_start_delalloc_root's nr argument a long
       [not found] <20210224125026.481804-1-sashal@kernel.org>
  2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 53/67] btrfs: fix error handling in commit_fs_roots Sasha Levin
@ 2021-02-24 12:50 ` Sasha Levin
  2021-02-24 18:09   ` David Sterba
  2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 55/67] btrfs: only let one thread pre-flush delayed refs in commit Sasha Levin
  2 siblings, 1 reply; 7+ messages in thread
From: Sasha Levin @ 2021-02-24 12:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Nikolay Borisov, Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Nikolay Borisov <nborisov@suse.com>

[ Upstream commit 9db4dc241e87fccd8301357d5ef908f40b50f2e3 ]

It's currently u64 which gets instantly translated either to LONG_MAX
(if U64_MAX is passed) or cast to an unsigned long (which is in fact,
wrong because writeback_control::nr_to_write is a signed, long type).

Just convert the function's argument to be long time which obviates the
need to manually convert u64 value to a long. Adjust all call sites
which pass U64_MAX to pass LONG_MAX. Finally ensure that in
shrink_delalloc the u64 is converted to a long without overflowing,
resulting in a negative number.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/ctree.h       | 2 +-
 fs/btrfs/dev-replace.c | 2 +-
 fs/btrfs/inode.c       | 6 +++---
 fs/btrfs/ioctl.c       | 2 +-
 fs/btrfs/space-info.c  | 3 ++-
 5 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4debdbdde2abb..81fde2d0327df 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3100,7 +3100,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 			       u32 min_type);
 
 int btrfs_start_delalloc_snapshot(struct btrfs_root *root);
-int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr,
+int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
 			       bool in_reclaim_context);
 int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 			      unsigned int extra_bits,
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 324f646d6e5e2..bc73f798ce3a8 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -715,7 +715,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	 * flush all outstanding I/O and inode extent mappings before the
 	 * copy operation is declared as being finished
 	 */
-	ret = btrfs_start_delalloc_roots(fs_info, U64_MAX, false);
+	ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false);
 	if (ret) {
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 		return ret;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a8e0a6b038d3e..3a02b17f76eb7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9486,11 +9486,11 @@ int btrfs_start_delalloc_snapshot(struct btrfs_root *root)
 	return start_delalloc_inodes(root, &wbc, true, false);
 }
 
-int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr,
+int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
 			       bool in_reclaim_context)
 {
 	struct writeback_control wbc = {
-		.nr_to_write = (nr == U64_MAX) ? LONG_MAX : (unsigned long)nr,
+		.nr_to_write = nr,
 		.sync_mode = WB_SYNC_NONE,
 		.range_start = 0,
 		.range_end = LLONG_MAX,
@@ -9512,7 +9512,7 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr,
 		 * Reset nr_to_write here so we know that we're doing a full
 		 * flush.
 		 */
-		if (nr == U64_MAX)
+		if (nr == LONG_MAX)
 			wbc.nr_to_write = LONG_MAX;
 
 		root = list_first_entry(&splice, struct btrfs_root,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index dde49a791f3e2..e26df790988ba 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4951,7 +4951,7 @@ long btrfs_ioctl(struct file *file, unsigned int
 	case BTRFS_IOC_SYNC: {
 		int ret;
 
-		ret = btrfs_start_delalloc_roots(fs_info, U64_MAX, false);
+		ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false);
 		if (ret)
 			return ret;
 		ret = btrfs_sync_fs(inode->i_sb, 1);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index e8347461c8ddd..84fb94e78a8ff 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -532,7 +532,8 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
 
 	loops = 0;
 	while ((delalloc_bytes || dio_bytes) && loops < 3) {
-		u64 nr_pages = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
+		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
+		long nr_pages = min_t(u64, temp, LONG_MAX);
 
 		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH AUTOSEL 5.11 55/67] btrfs: only let one thread pre-flush delayed refs in commit
       [not found] <20210224125026.481804-1-sashal@kernel.org>
  2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 53/67] btrfs: fix error handling in commit_fs_roots Sasha Levin
  2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 54/67] btrfs: make btrfs_start_delalloc_root's nr argument a long Sasha Levin
@ 2021-02-24 12:50 ` Sasha Levin
  2021-02-24 18:08   ` David Sterba
  2 siblings, 1 reply; 7+ messages in thread
From: Sasha Levin @ 2021-02-24 12:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Josef Bacik, Nikolay Borisov, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit e19eb11f4f3d3b0463cd897016064a79cb6d8c6d ]

I've been running a stress test that runs 20 workers in their own
subvolume, which are running an fsstress instance with 4 threads per
worker, which is 80 total fsstress threads.  In addition to this I'm
running balance in the background as well as creating and deleting
snapshots.  This test takes around 12 hours to run normally, going
slower and slower as the test goes on.

The reason for this is because fsstress is running fsync sometimes, and
because we're messing with block groups we often fall through to
btrfs_commit_transaction, so will often have 20-30 threads all calling
btrfs_commit_transaction at the same time.

These all get stuck contending on the extent tree while they try to run
delayed refs during the initial part of the commit.

This is suboptimal, really because the extent tree is a single point of
failure we only want one thread acting on that tree at once to reduce
lock contention.

Fix this by making the flushing mechanism a bit operation, to make it
easy to use test_and_set_bit() in order to make sure only one task does
this initial flush.

Once we're into the transaction commit we only have one thread doing
delayed ref running, it's just this initial pre-flush that is
problematic.  With this patch my stress test takes around 90 minutes to
run, instead of 12 hours.

The memory barrier is not necessary for the flushing bit as it's
ordered, unlike plain int. The transaction state accessed in
btrfs_should_end_transaction could be affected by that too as it's not
always used under transaction lock. Upon Nikolay's analysis in [1]
it's not necessary:

  In should_end_transaction it's read without holding any locks. (U)

  It's modified in btrfs_cleanup_transaction without holding the
  fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.

  set in cleanup_transaction under fs_info->trans_lock (L)
  set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
  set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
  set in btrfs_commit_trans to COMMIT_UNBLOCK under
  fs_info->trans_lock.(L)

  set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
  point the transaction is finished and fs_info->running_trans is NULL (U
  but irrelevant).

  So by the looks of it we can have a concurrent READ race with a WRITE,
  due to reads not taking a lock. In this case what we want to ensure is
  we either see new or old state. I consulted with Will Deacon and he said
  that in such a case we'd want to annotate the accesses to ->state with
  (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
  don't think this could happen but I imagine at some point KCSAN would
  flag such an access as racy (which it is).

[1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add comments regarding memory barrier ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/delayed-ref.h | 12 ++++++------
 fs/btrfs/transaction.c | 32 +++++++++++++++-----------------
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 1c977e6d45dc3..52364ea322d67 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -135,6 +135,11 @@ struct btrfs_delayed_data_ref {
 	u64 offset;
 };
 
+enum btrfs_delayed_ref_flags {
+	/* Indicate that we are flushing delayed refs for the commit */
+	BTRFS_DELAYED_REFS_FLUSHING,
+};
+
 struct btrfs_delayed_ref_root {
 	/* head ref rbtree */
 	struct rb_root_cached href_root;
@@ -158,12 +163,7 @@ struct btrfs_delayed_ref_root {
 
 	u64 pending_csums;
 
-	/*
-	 * set when the tree is flushing before a transaction commit,
-	 * used by the throttling code to decide if new updates need
-	 * to be run right away
-	 */
-	int flushing;
+	unsigned long flags;
 
 	u64 run_delayed_start;
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index fbf93067642ac..3cced84752178 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -909,9 +909,8 @@ bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
 
-	smp_mb();
 	if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
-	    cur_trans->delayed_refs.flushing)
+	    test_bit(BTRFS_DELAYED_REFS_FLUSHING, &cur_trans->delayed_refs.flags))
 		return true;
 
 	return should_end_transaction(trans);
@@ -2043,23 +2042,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	btrfs_trans_release_metadata(trans);
 	trans->block_rsv = NULL;
 
-	/* make a pass through all the delayed refs we have so far
-	 * any runnings procs may add more while we are here
-	 */
-	ret = btrfs_run_delayed_refs(trans, 0);
-	if (ret) {
-		btrfs_end_transaction(trans);
-		return ret;
-	}
-
-	cur_trans = trans->transaction;
-
 	/*
-	 * set the flushing flag so procs in this transaction have to
-	 * start sending their work down.
+	 * We only want one transaction commit doing the flushing so we do not
+	 * waste a bunch of time on lock contention on the extent root node.
 	 */
-	cur_trans->delayed_refs.flushing = 1;
-	smp_wmb();
+	if (!test_and_set_bit(BTRFS_DELAYED_REFS_FLUSHING,
+			      &cur_trans->delayed_refs.flags)) {
+		/*
+		 * Make a pass through all the delayed refs we have so far.
+		 * Any running threads may add more while we are here.
+		 */
+		ret = btrfs_run_delayed_refs(trans, 0);
+		if (ret) {
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+	}
 
 	btrfs_create_pending_block_groups(trans);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH AUTOSEL 5.11 55/67] btrfs: only let one thread pre-flush delayed refs in commit
  2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 55/67] btrfs: only let one thread pre-flush delayed refs in commit Sasha Levin
@ 2021-02-24 18:08   ` David Sterba
  2021-03-04 22:24     ` Sasha Levin
  0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2021-02-24 18:08 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Josef Bacik, Nikolay Borisov, David Sterba,
	linux-btrfs

On Wed, Feb 24, 2021 at 07:50:13AM -0500, Sasha Levin wrote:
> From: Josef Bacik <josef@toxicpanda.com>
> 
> [ Upstream commit e19eb11f4f3d3b0463cd897016064a79cb6d8c6d ]
> 
> I've been running a stress test that runs 20 workers in their own
> subvolume, which are running an fsstress instance with 4 threads per
> worker, which is 80 total fsstress threads.  In addition to this I'm
> running balance in the background as well as creating and deleting
> snapshots.  This test takes around 12 hours to run normally, going
> slower and slower as the test goes on.
> 
> The reason for this is because fsstress is running fsync sometimes, and
> because we're messing with block groups we often fall through to
> btrfs_commit_transaction, so will often have 20-30 threads all calling
> btrfs_commit_transaction at the same time.
> 
> These all get stuck contending on the extent tree while they try to run
> delayed refs during the initial part of the commit.
> 
> This is suboptimal, really because the extent tree is a single point of
> failure we only want one thread acting on that tree at once to reduce
> lock contention.
> 
> Fix this by making the flushing mechanism a bit operation, to make it
> easy to use test_and_set_bit() in order to make sure only one task does
> this initial flush.
> 
> Once we're into the transaction commit we only have one thread doing
> delayed ref running, it's just this initial pre-flush that is
> problematic.  With this patch my stress test takes around 90 minutes to
> run, instead of 12 hours.
> 
> The memory barrier is not necessary for the flushing bit as it's
> ordered, unlike plain int. The transaction state accessed in
> btrfs_should_end_transaction could be affected by that too as it's not
> always used under transaction lock. Upon Nikolay's analysis in [1]
> it's not necessary:
> 
>   In should_end_transaction it's read without holding any locks. (U)
> 
>   It's modified in btrfs_cleanup_transaction without holding the
>   fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.
> 
>   set in cleanup_transaction under fs_info->trans_lock (L)
>   set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
>   set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
>   set in btrfs_commit_trans to COMMIT_UNBLOCK under
>   fs_info->trans_lock.(L)
> 
>   set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
>   point the transaction is finished and fs_info->running_trans is NULL (U
>   but irrelevant).
> 
>   So by the looks of it we can have a concurrent READ race with a WRITE,
>   due to reads not taking a lock. In this case what we want to ensure is
>   we either see new or old state. I consulted with Will Deacon and he said
>   that in such a case we'd want to annotate the accesses to ->state with
>   (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
>   don't think this could happen but I imagine at some point KCSAN would
>   flag such an access as racy (which it is).
> 
> [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com
> 
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> [ add comments regarding memory barrier ]
> Signed-off-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Sasha Levin <sashal@kernel.org>

Please drop this patch from autosel queue, it's part of a larger series
that reworks flushing and is not a standalone fix.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH AUTOSEL 5.11 54/67] btrfs: make btrfs_start_delalloc_root's nr argument a long
  2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 54/67] btrfs: make btrfs_start_delalloc_root's nr argument a long Sasha Levin
@ 2021-02-24 18:09   ` David Sterba
  2021-03-04 22:24     ` Sasha Levin
  0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2021-02-24 18:09 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Nikolay Borisov, Josef Bacik, David Sterba,
	linux-btrfs

On Wed, Feb 24, 2021 at 07:50:12AM -0500, Sasha Levin wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> [ Upstream commit 9db4dc241e87fccd8301357d5ef908f40b50f2e3 ]
> 
> It's currently u64 which gets instantly translated either to LONG_MAX
> (if U64_MAX is passed) or cast to an unsigned long (which is in fact,
> wrong because writeback_control::nr_to_write is a signed, long type).
> 
> Just convert the function's argument to be long time which obviates the
> need to manually convert u64 value to a long. Adjust all call sites
> which pass U64_MAX to pass LONG_MAX. Finally ensure that in
> shrink_delalloc the u64 is converted to a long without overflowing,
> resulting in a negative number.

This patch is a cleanup and I don't see any other patch depend on it, so
please drop it from autosel.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH AUTOSEL 5.11 55/67] btrfs: only let one thread pre-flush delayed refs in commit
  2021-02-24 18:08   ` David Sterba
@ 2021-03-04 22:24     ` Sasha Levin
  0 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2021-03-04 22:24 UTC (permalink / raw)
  To: dsterba, linux-kernel, stable, Josef Bacik, Nikolay Borisov,
	David Sterba, linux-btrfs

On Wed, Feb 24, 2021 at 07:08:20PM +0100, David Sterba wrote:
>On Wed, Feb 24, 2021 at 07:50:13AM -0500, Sasha Levin wrote:
>> From: Josef Bacik <josef@toxicpanda.com>
>>
>> [ Upstream commit e19eb11f4f3d3b0463cd897016064a79cb6d8c6d ]
>>
>> I've been running a stress test that runs 20 workers in their own
>> subvolume, which are running an fsstress instance with 4 threads per
>> worker, which is 80 total fsstress threads.  In addition to this I'm
>> running balance in the background as well as creating and deleting
>> snapshots.  This test takes around 12 hours to run normally, going
>> slower and slower as the test goes on.
>>
>> The reason for this is because fsstress is running fsync sometimes, and
>> because we're messing with block groups we often fall through to
>> btrfs_commit_transaction, so will often have 20-30 threads all calling
>> btrfs_commit_transaction at the same time.
>>
>> These all get stuck contending on the extent tree while they try to run
>> delayed refs during the initial part of the commit.
>>
>> This is suboptimal, really because the extent tree is a single point of
>> failure we only want one thread acting on that tree at once to reduce
>> lock contention.
>>
>> Fix this by making the flushing mechanism a bit operation, to make it
>> easy to use test_and_set_bit() in order to make sure only one task does
>> this initial flush.
>>
>> Once we're into the transaction commit we only have one thread doing
>> delayed ref running, it's just this initial pre-flush that is
>> problematic.  With this patch my stress test takes around 90 minutes to
>> run, instead of 12 hours.
>>
>> The memory barrier is not necessary for the flushing bit as it's
>> ordered, unlike plain int. The transaction state accessed in
>> btrfs_should_end_transaction could be affected by that too as it's not
>> always used under transaction lock. Upon Nikolay's analysis in [1]
>> it's not necessary:
>>
>>   In should_end_transaction it's read without holding any locks. (U)
>>
>>   It's modified in btrfs_cleanup_transaction without holding the
>>   fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.
>>
>>   set in cleanup_transaction under fs_info->trans_lock (L)
>>   set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
>>   set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
>>   set in btrfs_commit_trans to COMMIT_UNBLOCK under
>>   fs_info->trans_lock.(L)
>>
>>   set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
>>   point the transaction is finished and fs_info->running_trans is NULL (U
>>   but irrelevant).
>>
>>   So by the looks of it we can have a concurrent READ race with a WRITE,
>>   due to reads not taking a lock. In this case what we want to ensure is
>>   we either see new or old state. I consulted with Will Deacon and he said
>>   that in such a case we'd want to annotate the accesses to ->state with
>>   (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
>>   don't think this could happen but I imagine at some point KCSAN would
>>   flag such an access as racy (which it is).
>>
>> [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com
>>
>> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> [ add comments regarding memory barrier ]
>> Signed-off-by: David Sterba <dsterba@suse.com>
>> Signed-off-by: Sasha Levin <sashal@kernel.org>
>
>Please drop this patch from autosel queue, it's part of a larger series
>that reworks flushing and is not a standalone fix.

Will do, thanks!

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH AUTOSEL 5.11 54/67] btrfs: make btrfs_start_delalloc_root's nr argument a long
  2021-02-24 18:09   ` David Sterba
@ 2021-03-04 22:24     ` Sasha Levin
  0 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2021-03-04 22:24 UTC (permalink / raw)
  To: dsterba, linux-kernel, stable, Nikolay Borisov, Josef Bacik,
	David Sterba, linux-btrfs

On Wed, Feb 24, 2021 at 07:09:42PM +0100, David Sterba wrote:
>On Wed, Feb 24, 2021 at 07:50:12AM -0500, Sasha Levin wrote:
>> From: Nikolay Borisov <nborisov@suse.com>
>>
>> [ Upstream commit 9db4dc241e87fccd8301357d5ef908f40b50f2e3 ]
>>
>> It's currently u64 which gets instantly translated either to LONG_MAX
>> (if U64_MAX is passed) or cast to an unsigned long (which is in fact,
>> wrong because writeback_control::nr_to_write is a signed, long type).
>>
>> Just convert the function's argument to be long time which obviates the
>> need to manually convert u64 value to a long. Adjust all call sites
>> which pass U64_MAX to pass LONG_MAX. Finally ensure that in
>> shrink_delalloc the u64 is converted to a long without overflowing,
>> resulting in a negative number.
>
>This patch is a cleanup and I don't see any other patch depend on it, so
>please drop it from autosel.

I'll drop it, thanks!

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-03-04 22:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210224125026.481804-1-sashal@kernel.org>
2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 53/67] btrfs: fix error handling in commit_fs_roots Sasha Levin
2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 54/67] btrfs: make btrfs_start_delalloc_root's nr argument a long Sasha Levin
2021-02-24 18:09   ` David Sterba
2021-03-04 22:24     ` Sasha Levin
2021-02-24 12:50 ` [PATCH AUTOSEL 5.11 55/67] btrfs: only let one thread pre-flush delayed refs in commit Sasha Levin
2021-02-24 18:08   ` David Sterba
2021-03-04 22:24     ` Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).