stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] btrfs: fix possible free space tree corruption with online conversion
@ 2021-01-15 21:26 Josef Bacik
  2021-01-24 13:14 ` David Sterba
  0 siblings, 1 reply; 3+ messages in thread
From: Josef Bacik @ 2021-01-15 21:26 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: stable

While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an -EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.

This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same.  During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups.  While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.

Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method.  We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.

CC: stable@vger.kernel.org # 4.5+
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
v2->v3:
- Updated the changelog to be more specific about the scenario that the problem
  happens under.
- Fixed the stable tag as well.

v1->v2:
- Dropped the UNTRUSTED bit on abort.

 fs/btrfs/block-group.c     | 11 ++++++++++-
 fs/btrfs/ctree.h           |  3 +++
 fs/btrfs/free-space-tree.c | 10 +++++++++-
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 0886e81e5540..290f05e87a6b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -673,7 +673,16 @@ static noinline void caching_thread(struct btrfs_work *work)
 		wake_up(&caching_ctl->wait);
 	}
 
-	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+	/*
+	 * If we are in the transaction that populated the free space tree we
+	 * can't actually cache from the free space tree as our commit root and
+	 * real root are the same, so we could change the contents of the blocks
+	 * while caching.  Instead do the slow caching in this case, and after
+	 * the transaction has committed we will be safe.
+	 */
+	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE) &&
+	    !(test_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
+		       &fs_info->flags)))
 		ret = load_free_space_tree(caching_ctl);
 	else
 		ret = load_extent_tree_free(caching_ctl);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1bd416e5a731..1fdb5dbe5287 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -563,6 +563,9 @@ enum {
 
 	/* Indicate that we need to cleanup space cache v1 */
 	BTRFS_FS_CLEANUP_SPACE_CACHE_V1,
+
+	/* Indicate that we can't trust the free space tree for caching yet. */
+	BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
 };
 
 /*
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index e33a65bd9a0c..a33bca94d133 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1150,6 +1150,7 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
 		return PTR_ERR(trans);
 
 	set_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);
+	set_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
 	free_space_root = btrfs_create_tree(trans,
 					    BTRFS_FREE_SPACE_TREE_OBJECTID);
 	if (IS_ERR(free_space_root)) {
@@ -1171,11 +1172,18 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
 	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
 	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID);
 	clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);
+	ret = btrfs_commit_transaction(trans);
 
-	return btrfs_commit_transaction(trans);
+	/*
+	 * Now that we've committed the transaction any reading of our commit
+	 * root will be safe, so we can cache from the free space tree now.
+	 */
+	clear_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
+	return ret;
 
 abort:
 	clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);
+	clear_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
 	btrfs_abort_transaction(trans, ret);
 	btrfs_end_transaction(trans);
 	return ret;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] btrfs: fix possible free space tree corruption with online conversion
  2021-01-15 21:26 [PATCH v3] btrfs: fix possible free space tree corruption with online conversion Josef Bacik
@ 2021-01-24 13:14 ` David Sterba
  2021-01-25 14:56   ` Josef Bacik
  0 siblings, 1 reply; 3+ messages in thread
From: David Sterba @ 2021-01-24 13:14 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, stable

On Fri, Jan 15, 2021 at 04:26:17PM -0500, Josef Bacik wrote:
> While running btrfs/011 in a loop I would often ASSERT() while trying to
> add a new free space entry that already existed, or get an -EEXIST while
> adding a new block to the extent tree, which is another indication of
> double allocation.
> 
> This occurs because when we do the free space tree population, we create
> the new root and then populate the tree and commit the transaction.
> The problem is when you create a new root, the root node and commit root
> node are the same.  During this initial transaction commit we will run
> all of the delayed refs that were paused during the free space tree
> generation, and thus begin to cache block groups.  While caching block
> groups the caching thread will be reading from the main root for the
> free space tree, so as we make allocations we'll be changing the free
> space tree, which can cause us to add the same range twice which results
> in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a

Still no stacktraces but at least this gives some pointer to the code
which assert is hit.

> variety of different errors when running delayed refs because of a
> double allocation.
> 
> Fix this by marking the fs_info as unsafe to load the free space tree,
> and fall back on the old slow method.  We could be smarter than this,
> for example caching the block group while we're populating the free
> space tree, but since this is a serious problem I've opted for the
> simplest solution.
> 
> CC: stable@vger.kernel.org # 4.5+
> Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Added to misc-next with Filipe's review from v2, thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] btrfs: fix possible free space tree corruption with online conversion
  2021-01-24 13:14 ` David Sterba
@ 2021-01-25 14:56   ` Josef Bacik
  0 siblings, 0 replies; 3+ messages in thread
From: Josef Bacik @ 2021-01-25 14:56 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team, stable

On 1/24/21 8:14 AM, David Sterba wrote:
> On Fri, Jan 15, 2021 at 04:26:17PM -0500, Josef Bacik wrote:
>> While running btrfs/011 in a loop I would often ASSERT() while trying to
>> add a new free space entry that already existed, or get an -EEXIST while
>> adding a new block to the extent tree, which is another indication of
>> double allocation.
>>
>> This occurs because when we do the free space tree population, we create
>> the new root and then populate the tree and commit the transaction.
>> The problem is when you create a new root, the root node and commit root
>> node are the same.  During this initial transaction commit we will run
>> all of the delayed refs that were paused during the free space tree
>> generation, and thus begin to cache block groups.  While caching block
>> groups the caching thread will be reading from the main root for the
>> free space tree, so as we make allocations we'll be changing the free
>> space tree, which can cause us to add the same range twice which results
>> in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
> 
> Still no stacktraces but at least this gives some pointer to the code
> which assert is hit.

Sorry I meant to put this in the changelog for the patch but forgot.  I tried to 
reproduce the problem for a day on a couple of boxes but couldn't, and I had 
lost the original logs.  Next time I'll capture the stacktraces themselves, but 
I remembered which ASSERT() it was so just noted that in the log.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-25 15:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-15 21:26 [PATCH v3] btrfs: fix possible free space tree corruption with online conversion Josef Bacik
2021-01-24 13:14 ` David Sterba
2021-01-25 14:56   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).