All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: fix log replay failure due to race with space cache rebuild
@ 2021-01-22 17:56 fdmanana
  2021-01-22 18:39 ` Josef Bacik
  2021-01-22 19:07 ` [PATCH v3] " fdmanana
  0 siblings, 2 replies; 6+ messages in thread
From: fdmanana @ 2021-01-22 17:56 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

After a sudden power failure we may end up with a space cache on disk that
is not valid and needs to be rebuilt from scratch.

If that happens, during log replay when we attempt to pin an extent buffer
from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
the space cache to be rebuilt through the call to:

    btrfs_cache_block_group(cache, 1);

That is because that only waits for the task (work queue job) that loads
the space cache to change the cache state from BTRFS_CACHE_FAST to any
other value. That is ok when the space cache on disk exists and is valid,
but when the cache is not valid and needs to be rebuilt, it ends up
returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
at caching_thread()).

So this means that we can end up trying to unpin a range which is not yet
marked as free in the block group. This results in the call to
btrfs_remove_free_space() to return -EINVAL to
btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
as well as mounting the filesystem. More specifically the -EINVAL comes
from free_space_cache.c:remove_from_bitmap(), because the requested range
is not marked as free space (ones in the bitmap), we have the following
condition triggered:

static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
(...)
       if (ret < 0 || search_start != *offset)
            return -EINVAL;
(...)

It's the "search_start != *offset" that results in the condition being
evaluated to true.

When this happens we got the following in dmesg/syslog:

[72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
[72383.417837] BTRFS info (device sdb): disk space caching is enabled
[72383.418536] BTRFS info (device sdb): has skinny extents
[72383.423846] BTRFS info (device sdb): start tree-log replay
[72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
[72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
[72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
[72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
[72383.460241] BTRFS error (device sdb): open_ctree failed

We also mark the range for the extent buffer in the excluded extents io
tree. That is fine when the space cache is valid on disk and we can load
it, in which case it causes no problems.

However, for the case where we need to rebuild the space cache, because it
is either invalid or it is missing, having the extent buffer range marked
in the excluded extents io tree leads to a -EINVAL failure from the call
to btrfs_remove_free_space(), resulting in the log replay and mount to
fail. This is because by having the range marked in the excluded extents
io tree, the caching thread ends up never adding the range of the extent
buffer as free space in the block group since the calls to
add_new_free_space(), called from load_extent_tree_free(), filter out any
ranges that are marked as excluded extents.

So fix this by making sure that during log replay we wait for the caching
task to finish completely when we need to rebuild a space cache, and also
drop the need to mark the extent buffer range in the excluded extents io
tree, as well as clearing ranges from that tree at
btrfs_finish_extent_commit().

This started to happen with some frequency on large filesystems having
block groups with a lot of fragmentation since the recent commit
e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in
fact the issue has been there for years, it was just much less likely
to happen.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 30b1a630dc2f..89d1b0551cf8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2602,8 +2602,6 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 	struct btrfs_block_group *cache;
 	int ret;
 
-	btrfs_add_excluded_extent(trans->fs_info, bytenr, num_bytes);
-
 	cache = btrfs_lookup_block_group(trans->fs_info, bytenr);
 	if (!cache)
 		return -EINVAL;
@@ -2615,6 +2613,15 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 	 * the pinned extents.
 	 */
 	btrfs_cache_block_group(cache, 1);
+	/*
+	 * Make sure we wait until the cache is completely built in case it is
+	 * missing or is invalid and therefore needs to be rebuilt.
+	 */
+	if (btrfs_test_opt(trans->fs_info, SPACE_CACHE)) {
+		ret = btrfs_wait_block_group_cache_done(cache);
+		if (ret)
+			return ret;
+	}
 
 	pin_down_extent(trans, cache, bytenr, num_bytes, 0);
 
@@ -2635,7 +2642,7 @@ static int __exclude_logged_extent(struct btrfs_fs_info *fs_info,
 	if (!block_group)
 		return -EINVAL;
 
-	btrfs_cache_block_group(block_group, 0);
+	btrfs_cache_block_group(block_group, 1);
 	caching_ctl = btrfs_get_caching_control(block_group);
 
 	if (!caching_ctl) {
@@ -2644,10 +2651,15 @@ static int __exclude_logged_extent(struct btrfs_fs_info *fs_info,
 		ret = btrfs_remove_free_space(block_group, start, num_bytes);
 	} else {
 		/*
-		 * We must wait for v1 caching to finish, otherwise we may not
-		 * remove our space.
+		 * Make sure we wait until the cache is completely built in case it is
+		 * missing or is invalid and therefore needs to be rebuilt.
 		 */
-		btrfs_wait_space_cache_v1_finished(block_group, caching_ctl);
+		if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
+			ret = btrfs_wait_block_group_cache_done(block_group);
+			if (ret)
+				return ret;
+		}
+
 		mutex_lock(&caching_ctl->mutex);
 
 		if (start >= caching_ctl->progress) {
@@ -2863,9 +2875,6 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 			mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 			break;
 		}
-		if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
-			clear_extent_bits(&fs_info->excluded_extents, start,
-					  end, EXTENT_UPTODATE);
 
 		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
 			ret = btrfs_discard_extent(fs_info, start,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fix log replay failure due to race with space cache rebuild
  2021-01-22 17:56 [PATCH] btrfs: fix log replay failure due to race with space cache rebuild fdmanana
@ 2021-01-22 18:39 ` Josef Bacik
  2021-01-22 18:42   ` Filipe Manana
  2021-01-22 19:07 ` [PATCH v3] " fdmanana
  1 sibling, 1 reply; 6+ messages in thread
From: Josef Bacik @ 2021-01-22 18:39 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 1/22/21 12:56 PM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> After a sudden power failure we may end up with a space cache on disk that
> is not valid and needs to be rebuilt from scratch.
> 
> If that happens, during log replay when we attempt to pin an extent buffer
> from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
> the space cache to be rebuilt through the call to:
> 
>      btrfs_cache_block_group(cache, 1);
> 
> That is because that only waits for the task (work queue job) that loads
> the space cache to change the cache state from BTRFS_CACHE_FAST to any
> other value. That is ok when the space cache on disk exists and is valid,
> but when the cache is not valid and needs to be rebuilt, it ends up
> returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
> at caching_thread()).
> 
> So this means that we can end up trying to unpin a range which is not yet
> marked as free in the block group. This results in the call to
> btrfs_remove_free_space() to return -EINVAL to
> btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
> as well as mounting the filesystem. More specifically the -EINVAL comes
> from free_space_cache.c:remove_from_bitmap(), because the requested range
> is not marked as free space (ones in the bitmap), we have the following
> condition triggered:
> 
> static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
> (...)
>         if (ret < 0 || search_start != *offset)
>              return -EINVAL;
> (...)
> 
> It's the "search_start != *offset" that results in the condition being
> evaluated to true.
> 
> When this happens we got the following in dmesg/syslog:
> 
> [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
> [72383.417837] BTRFS info (device sdb): disk space caching is enabled
> [72383.418536] BTRFS info (device sdb): has skinny extents
> [72383.423846] BTRFS info (device sdb): start tree-log replay
> [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
> [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
> [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
> [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
> [72383.460241] BTRFS error (device sdb): open_ctree failed
> 
> We also mark the range for the extent buffer in the excluded extents io
> tree. That is fine when the space cache is valid on disk and we can load
> it, in which case it causes no problems.
> 
> However, for the case where we need to rebuild the space cache, because it
> is either invalid or it is missing, having the extent buffer range marked
> in the excluded extents io tree leads to a -EINVAL failure from the call
> to btrfs_remove_free_space(), resulting in the log replay and mount to
> fail. This is because by having the range marked in the excluded extents
> io tree, the caching thread ends up never adding the range of the extent
> buffer as free space in the block group since the calls to
> add_new_free_space(), called from load_extent_tree_free(), filter out any
> ranges that are marked as excluded extents.
> 
> So fix this by making sure that during log replay we wait for the caching
> task to finish completely when we need to rebuild a space cache, and also
> drop the need to mark the extent buffer range in the excluded extents io
> tree, as well as clearing ranges from that tree at
> btrfs_finish_extent_commit().
> 
> This started to happen with some frequency on large filesystems having
> block groups with a lot of fragmentation since the recent commit
> e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in
> fact the issue has been there for years, it was just much less likely
> to happen.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/extent-tree.c | 27 ++++++++++++++++++---------
>   1 file changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 30b1a630dc2f..89d1b0551cf8 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2602,8 +2602,6 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
>   	struct btrfs_block_group *cache;
>   	int ret;
>   
> -	btrfs_add_excluded_extent(trans->fs_info, bytenr, num_bytes);
> -
>   	cache = btrfs_lookup_block_group(trans->fs_info, bytenr);
>   	if (!cache)
>   		return -EINVAL;
> @@ -2615,6 +2613,15 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
>   	 * the pinned extents.
>   	 */
>   	btrfs_cache_block_group(cache, 1);
> +	/*
> +	 * Make sure we wait until the cache is completely built in case it is
> +	 * missing or is invalid and therefore needs to be rebuilt.
> +	 */
> +	if (btrfs_test_opt(trans->fs_info, SPACE_CACHE)) {
> +		ret = btrfs_wait_block_group_cache_done(cache);
> +		if (ret)
> +			return ret;
> +	}

Sorry I didn't explain this well in my previous reply and on slack.

If we are going to keep the btrfs_remove_free_space() below, which we have to, 
we cannot remove the btrfs_add_excluded_extent() without also unconditionally 
loading the space cache.  The 'load_cache_only' only means we'll wait for the 
space cache to be loaded in the SPACE_CACHE case, not that we won't start caching.

Consider the free space tree case, we'll be doing the normal caching, and we'll 
either hit the case that you're trying to fix, because we attempt to remove a 
free space range that has been partially cached in the bitmap.  Or we'll hit the 
case where we think we've removed the free space even though it's not been 
loaded yet.

Your fix needs to remove the excluded extent parts as well as do

btrfs_cache_block_group(cache, 0);
btrfs_wait_block_group_cache_done(cache);

in order to be properly safe in all cases.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fix log replay failure due to race with space cache rebuild
  2021-01-22 18:39 ` Josef Bacik
@ 2021-01-22 18:42   ` Filipe Manana
  0 siblings, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2021-01-22 18:42 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Fri, Jan 22, 2021 at 6:39 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 1/22/21 12:56 PM, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > After a sudden power failure we may end up with a space cache on disk that
> > is not valid and needs to be rebuilt from scratch.
> >
> > If that happens, during log replay when we attempt to pin an extent buffer
> > from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
> > the space cache to be rebuilt through the call to:
> >
> >      btrfs_cache_block_group(cache, 1);
> >
> > That is because that only waits for the task (work queue job) that loads
> > the space cache to change the cache state from BTRFS_CACHE_FAST to any
> > other value. That is ok when the space cache on disk exists and is valid,
> > but when the cache is not valid and needs to be rebuilt, it ends up
> > returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
> > at caching_thread()).
> >
> > So this means that we can end up trying to unpin a range which is not yet
> > marked as free in the block group. This results in the call to
> > btrfs_remove_free_space() to return -EINVAL to
> > btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
> > as well as mounting the filesystem. More specifically the -EINVAL comes
> > from free_space_cache.c:remove_from_bitmap(), because the requested range
> > is not marked as free space (ones in the bitmap), we have the following
> > condition triggered:
> >
> > static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
> > (...)
> >         if (ret < 0 || search_start != *offset)
> >              return -EINVAL;
> > (...)
> >
> > It's the "search_start != *offset" that results in the condition being
> > evaluated to true.
> >
> > When this happens we got the following in dmesg/syslog:
> >
> > [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
> > [72383.417837] BTRFS info (device sdb): disk space caching is enabled
> > [72383.418536] BTRFS info (device sdb): has skinny extents
> > [72383.423846] BTRFS info (device sdb): start tree-log replay
> > [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
> > [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
> > [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
> > [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
> > [72383.460241] BTRFS error (device sdb): open_ctree failed
> >
> > We also mark the range for the extent buffer in the excluded extents io
> > tree. That is fine when the space cache is valid on disk and we can load
> > it, in which case it causes no problems.
> >
> > However, for the case where we need to rebuild the space cache, because it
> > is either invalid or it is missing, having the extent buffer range marked
> > in the excluded extents io tree leads to a -EINVAL failure from the call
> > to btrfs_remove_free_space(), resulting in the log replay and mount to
> > fail. This is because by having the range marked in the excluded extents
> > io tree, the caching thread ends up never adding the range of the extent
> > buffer as free space in the block group since the calls to
> > add_new_free_space(), called from load_extent_tree_free(), filter out any
> > ranges that are marked as excluded extents.
> >
> > So fix this by making sure that during log replay we wait for the caching
> > task to finish completely when we need to rebuild a space cache, and also
> > drop the need to mark the extent buffer range in the excluded extents io
> > tree, as well as clearing ranges from that tree at
> > btrfs_finish_extent_commit().
> >
> > This started to happen with some frequency on large filesystems having
> > block groups with a lot of fragmentation since the recent commit
> > e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in
> > fact the issue has been there for years, it was just much less likely
> > to happen.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/extent-tree.c | 27 ++++++++++++++++++---------
> >   1 file changed, 18 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 30b1a630dc2f..89d1b0551cf8 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2602,8 +2602,6 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
> >       struct btrfs_block_group *cache;
> >       int ret;
> >
> > -     btrfs_add_excluded_extent(trans->fs_info, bytenr, num_bytes);
> > -
> >       cache = btrfs_lookup_block_group(trans->fs_info, bytenr);
> >       if (!cache)
> >               return -EINVAL;
> > @@ -2615,6 +2613,15 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
> >        * the pinned extents.
> >        */
> >       btrfs_cache_block_group(cache, 1);
> > +     /*
> > +      * Make sure we wait until the cache is completely built in case it is
> > +      * missing or is invalid and therefore needs to be rebuilt.
> > +      */
> > +     if (btrfs_test_opt(trans->fs_info, SPACE_CACHE)) {
> > +             ret = btrfs_wait_block_group_cache_done(cache);
> > +             if (ret)
> > +                     return ret;
> > +     }
>
> Sorry I didn't explain this well in my previous reply and on slack.
>
> If we are going to keep the btrfs_remove_free_space() below, which we have to,
> we cannot remove the btrfs_add_excluded_extent() without also unconditionally
> loading the space cache.  The 'load_cache_only' only means we'll wait for the
> space cache to be loaded in the SPACE_CACHE case, not that we won't start caching.
>
> Consider the free space tree case, we'll be doing the normal caching, and we'll
> either hit the case that you're trying to fix, because we attempt to remove a
> free space range that has been partially cached in the bitmap.  Or we'll hit the
> case where we think we've removed the free space even though it's not been
> loaded yet.
>
> Your fix needs to remove the excluded extent parts as well as do
>
> btrfs_cache_block_group(cache, 0);
> btrfs_wait_block_group_cache_done(cache);
>
> in order to be properly safe in all cases.  Thanks,

Yes, I realized that later, I have a new version being tested right now:

https://pastebin.com/pgQ2y11a

I messed and didn't realize the wait is necessary even if there is no
space cache or tree enabled.

>
> Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3] btrfs: fix log replay failure due to race with space cache rebuild
  2021-01-22 17:56 [PATCH] btrfs: fix log replay failure due to race with space cache rebuild fdmanana
  2021-01-22 18:39 ` Josef Bacik
@ 2021-01-22 19:07 ` fdmanana
  2021-01-22 19:19   ` Josef Bacik
  2021-01-25 17:40   ` David Sterba
  1 sibling, 2 replies; 6+ messages in thread
From: fdmanana @ 2021-01-22 19:07 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

After a sudden power failure we may end up with a space cache on disk that
is not valid and needs to be rebuilt from scratch.

If that happens, during log replay when we attempt to pin an extent buffer
from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
the space cache to be rebuilt through the call to:

    btrfs_cache_block_group(cache, 1);

That is because that only waits for the task (work queue job) that loads
the space cache to change the cache state from BTRFS_CACHE_FAST to any
other value. That is ok when the space cache on disk exists and is valid,
but when the cache is not valid and needs to be rebuilt, it ends up
returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
at caching_thread()).

So this means that we can end up trying to unpin a range which is not yet
marked as free in the block group. This results in the call to
btrfs_remove_free_space() to return -EINVAL to
btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
as well as mounting the filesystem. More specifically the -EINVAL comes
from free_space_cache.c:remove_from_bitmap(), because the requested range
is not marked as free space (ones in the bitmap), we have the following
condition triggered:

static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
(...)
       if (ret < 0 || search_start != *offset)
            return -EINVAL;
(...)

It's the "search_start != *offset" that results in the condition being
evaluated to true.

When this happens we got the following in dmesg/syslog:

[72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
[72383.417837] BTRFS info (device sdb): disk space caching is enabled
[72383.418536] BTRFS info (device sdb): has skinny extents
[72383.423846] BTRFS info (device sdb): start tree-log replay
[72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
[72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
[72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
[72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
[72383.460241] BTRFS error (device sdb): open_ctree failed

We also mark the range for the extent buffer in the excluded extents io
tree. That is fine when the space cache is valid on disk and we can load
it, in which case it causes no problems.

However, for the case where we need to rebuild the space cache, because it
is either invalid or it is missing, having the extent buffer range marked
in the excluded extents io tree leads to a -EINVAL failure from the call
to btrfs_remove_free_space(), resulting in the log replay and mount to
fail. This is because by having the range marked in the excluded extents
io tree, the caching thread ends up never adding the range of the extent
buffer as free space in the block group since the calls to
add_new_free_space(), called from load_extent_tree_free(), filter out any
ranges that are marked as excluded extents.

So fix this by making sure that during log replay we wait for the caching
task to finish completely when we need to rebuild a space cache, and also
drop the need to mark the extent buffer range in the excluded extents io
tree, as well as clearing ranges from that tree at
btrfs_finish_extent_commit().

This started to happen with some frequency on large filesystems having
block groups with a lot of fragmentation since the recent commit
e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in
fact the issue has been there for years, it was just much less likely
to happen.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Updated patch with Josef's comments and combined both patches into a
    single one with an updated changelog.
V3: Eliminated no longer needed code, we do wait for the caching to be
    complete before removing free space, so a lot of code from
    __exclude_logged_extent() is no longer needed. Also fixed a leak of
    the block group in case an error happens.

 fs/btrfs/extent-tree.c | 61 +++++++++++++-----------------------------
 1 file changed, 18 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 30b1a630dc2f..0c335dae5af7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2602,8 +2602,6 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 	struct btrfs_block_group *cache;
 	int ret;
 
-	btrfs_add_excluded_extent(trans->fs_info, bytenr, num_bytes);
-
 	cache = btrfs_lookup_block_group(trans->fs_info, bytenr);
 	if (!cache)
 		return -EINVAL;
@@ -2615,11 +2613,19 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 	 * the pinned extents.
 	 */
 	btrfs_cache_block_group(cache, 1);
+	/*
+	 * Make sure we wait until the cache is completely built in case it is
+	 * missing or is invalid and therefore needs to be rebuilt.
+	 */
+	ret = btrfs_wait_block_group_cache_done(cache);
+	if (ret)
+		goto out;
 
 	pin_down_extent(trans, cache, bytenr, num_bytes, 0);
 
 	/* remove us from the free space cache (if we're there at all) */
 	ret = btrfs_remove_free_space(cache, bytenr, num_bytes);
+out:
 	btrfs_put_block_group(cache);
 	return ret;
 }
@@ -2629,50 +2635,22 @@ static int __exclude_logged_extent(struct btrfs_fs_info *fs_info,
 {
 	int ret;
 	struct btrfs_block_group *block_group;
-	struct btrfs_caching_control *caching_ctl;
 
 	block_group = btrfs_lookup_block_group(fs_info, start);
 	if (!block_group)
 		return -EINVAL;
 
-	btrfs_cache_block_group(block_group, 0);
-	caching_ctl = btrfs_get_caching_control(block_group);
-
-	if (!caching_ctl) {
-		/* Logic error */
-		BUG_ON(!btrfs_block_group_done(block_group));
-		ret = btrfs_remove_free_space(block_group, start, num_bytes);
-	} else {
-		/*
-		 * We must wait for v1 caching to finish, otherwise we may not
-		 * remove our space.
-		 */
-		btrfs_wait_space_cache_v1_finished(block_group, caching_ctl);
-		mutex_lock(&caching_ctl->mutex);
-
-		if (start >= caching_ctl->progress) {
-			ret = btrfs_add_excluded_extent(fs_info, start,
-							num_bytes);
-		} else if (start + num_bytes <= caching_ctl->progress) {
-			ret = btrfs_remove_free_space(block_group,
-						      start, num_bytes);
-		} else {
-			num_bytes = caching_ctl->progress - start;
-			ret = btrfs_remove_free_space(block_group,
-						      start, num_bytes);
-			if (ret)
-				goto out_lock;
+	btrfs_cache_block_group(block_group, 1);
+	/*
+	 * Make sure we wait until the cache is completely built in case it is
+	 * missing or is invalid and therefore needs to be rebuilt.
+	 */
+	ret = btrfs_wait_block_group_cache_done(block_group);
+	if (ret)
+		goto out;
 
-			num_bytes = (start + num_bytes) -
-				caching_ctl->progress;
-			start = caching_ctl->progress;
-			ret = btrfs_add_excluded_extent(fs_info, start,
-							num_bytes);
-		}
-out_lock:
-		mutex_unlock(&caching_ctl->mutex);
-		btrfs_put_caching_control(caching_ctl);
-	}
+	ret = btrfs_remove_free_space(block_group, start, num_bytes);
+out:
 	btrfs_put_block_group(block_group);
 	return ret;
 }
@@ -2863,9 +2841,6 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 			mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 			break;
 		}
-		if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
-			clear_extent_bits(&fs_info->excluded_extents, start,
-					  end, EXTENT_UPTODATE);
 
 		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
 			ret = btrfs_discard_extent(fs_info, start,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] btrfs: fix log replay failure due to race with space cache rebuild
  2021-01-22 19:07 ` [PATCH v3] " fdmanana
@ 2021-01-22 19:19   ` Josef Bacik
  2021-01-25 17:40   ` David Sterba
  1 sibling, 0 replies; 6+ messages in thread
From: Josef Bacik @ 2021-01-22 19:19 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 1/22/21 2:07 PM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> After a sudden power failure we may end up with a space cache on disk that
> is not valid and needs to be rebuilt from scratch.
> 
> If that happens, during log replay when we attempt to pin an extent buffer
> from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
> the space cache to be rebuilt through the call to:
> 
>      btrfs_cache_block_group(cache, 1);
> 
> That is because that only waits for the task (work queue job) that loads
> the space cache to change the cache state from BTRFS_CACHE_FAST to any
> other value. That is ok when the space cache on disk exists and is valid,
> but when the cache is not valid and needs to be rebuilt, it ends up
> returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
> at caching_thread()).
> 
> So this means that we can end up trying to unpin a range which is not yet
> marked as free in the block group. This results in the call to
> btrfs_remove_free_space() to return -EINVAL to
> btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
> as well as mounting the filesystem. More specifically the -EINVAL comes
> from free_space_cache.c:remove_from_bitmap(), because the requested range
> is not marked as free space (ones in the bitmap), we have the following
> condition triggered:
> 
> static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
> (...)
>         if (ret < 0 || search_start != *offset)
>              return -EINVAL;
> (...)
> 
> It's the "search_start != *offset" that results in the condition being
> evaluated to true.
> 
> When this happens we got the following in dmesg/syslog:
> 
> [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
> [72383.417837] BTRFS info (device sdb): disk space caching is enabled
> [72383.418536] BTRFS info (device sdb): has skinny extents
> [72383.423846] BTRFS info (device sdb): start tree-log replay
> [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
> [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
> [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
> [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
> [72383.460241] BTRFS error (device sdb): open_ctree failed
> 
> We also mark the range for the extent buffer in the excluded extents io
> tree. That is fine when the space cache is valid on disk and we can load
> it, in which case it causes no problems.
> 
> However, for the case where we need to rebuild the space cache, because it
> is either invalid or it is missing, having the extent buffer range marked
> in the excluded extents io tree leads to a -EINVAL failure from the call
> to btrfs_remove_free_space(), resulting in the log replay and mount to
> fail. This is because by having the range marked in the excluded extents
> io tree, the caching thread ends up never adding the range of the extent
> buffer as free space in the block group since the calls to
> add_new_free_space(), called from load_extent_tree_free(), filter out any
> ranges that are marked as excluded extents.
> 
> So fix this by making sure that during log replay we wait for the caching
> task to finish completely when we need to rebuild a space cache, and also
> drop the need to mark the extent buffer range in the excluded extents io
> tree, as well as clearing ranges from that tree at
> btrfs_finish_extent_commit().
> 
> This started to happen with some frequency on large filesystems having
> block groups with a lot of fragmentation since the recent commit
> e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in
> fact the issue has been there for years, it was just much less likely
> to happen.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] btrfs: fix log replay failure due to race with space cache rebuild
  2021-01-22 19:07 ` [PATCH v3] " fdmanana
  2021-01-22 19:19   ` Josef Bacik
@ 2021-01-25 17:40   ` David Sterba
  1 sibling, 0 replies; 6+ messages in thread
From: David Sterba @ 2021-01-25 17:40 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, Jan 22, 2021 at 07:07:45PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> After a sudden power failure we may end up with a space cache on disk that
> is not valid and needs to be rebuilt from scratch.
> 
> If that happens, during log replay when we attempt to pin an extent buffer
> from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
> the space cache to be rebuilt through the call to:
> 
>     btrfs_cache_block_group(cache, 1);
> 
> That is because that only waits for the task (work queue job) that loads
> the space cache to change the cache state from BTRFS_CACHE_FAST to any
> other value. That is ok when the space cache on disk exists and is valid,
> but when the cache is not valid and needs to be rebuilt, it ends up
> returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
> at caching_thread()).
> 
> So this means that we can end up trying to unpin a range which is not yet
> marked as free in the block group. This results in the call to
> btrfs_remove_free_space() to return -EINVAL to
> btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
> as well as mounting the filesystem. More specifically the -EINVAL comes
> from free_space_cache.c:remove_from_bitmap(), because the requested range
> is not marked as free space (ones in the bitmap), we have the following
> condition triggered:
> 
> static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
> (...)
>        if (ret < 0 || search_start != *offset)
>             return -EINVAL;
> (...)
> 
> It's the "search_start != *offset" that results in the condition being
> evaluated to true.
> 
> When this happens we got the following in dmesg/syslog:
> 
> [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
> [72383.417837] BTRFS info (device sdb): disk space caching is enabled
> [72383.418536] BTRFS info (device sdb): has skinny extents
> [72383.423846] BTRFS info (device sdb): start tree-log replay
> [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
> [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
> [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
> [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
> [72383.460241] BTRFS error (device sdb): open_ctree failed
> 
> We also mark the range for the extent buffer in the excluded extents io
> tree. That is fine when the space cache is valid on disk and we can load
> it, in which case it causes no problems.
> 
> However, for the case where we need to rebuild the space cache, because it
> is either invalid or it is missing, having the extent buffer range marked
> in the excluded extents io tree leads to a -EINVAL failure from the call
> to btrfs_remove_free_space(), resulting in the log replay and mount to
> fail. This is because by having the range marked in the excluded extents
> io tree, the caching thread ends up never adding the range of the extent
> buffer as free space in the block group since the calls to
> add_new_free_space(), called from load_extent_tree_free(), filter out any
> ranges that are marked as excluded extents.
> 
> So fix this by making sure that during log replay we wait for the caching
> task to finish completely when we need to rebuild a space cache, and also
> drop the need to mark the extent buffer range in the excluded extents io
> tree, as well as clearing ranges from that tree at
> btrfs_finish_extent_commit().
> 
> This started to happen with some frequency on large filesystems having
> block groups with a lot of fragmentation since the recent commit
> e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in
> fact the issue has been there for years, it was just much less likely
> to happen.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks. As this is more likely to happen on 5.11 due
to async loading I'll probably forward it to the next rc.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-01-25 18:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-22 17:56 [PATCH] btrfs: fix log replay failure due to race with space cache rebuild fdmanana
2021-01-22 18:39 ` Josef Bacik
2021-01-22 18:42   ` Filipe Manana
2021-01-22 19:07 ` [PATCH v3] " fdmanana
2021-01-22 19:19   ` Josef Bacik
2021-01-25 17:40   ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.