[PATCH v2] btrfs: remove an ASSERT() which can cause false alert

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
@ 2022-02-17  9:49 Qu Wenruo
  2022-02-17 11:14 ` Filipe Manana
  2022-04-19 20:34 ` David Sterba
  0 siblings, 2 replies; 9+ messages in thread
From: Qu Wenruo @ 2022-02-17  9:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Su Yue

[BUG]
Su reported that with his VM running on an M1 mac, it has a high chance
to trigger the following ASSERT() with scrub and balance test cases:

		ASSERT(cache->start == chunk_offset);
		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
				  dev_extent_len);

[CAUSE]
The ASSERT() is making sure that the block group cache (@cache) and the
chunk offset from the device extent match.

But the truth is, block group caches and dev extent items are deleted at
different timing.

For block group caches and items, after btrfs_remove_block_group() they
will be marked deleted, but the device extent and the chunk item is
still in dev tree and chunk tree respectively.

The device extents and chunk items are only deleted in
btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
or btrfs_relocate_chunk().

This timing difference leaves a window where we can have a mismatch
between block group cache and device extent tree, and triggering the
ASSERT().

[FIX]

- Remove the ASSERT()

- Add a quick exit if our found bg doesn't match @chunk_offset

- Add a comment on the existing checks in scrub_chunk()
  This is the ultimate safenet, as it will iterate through all the stripes
  of the found chunk.
  And only scrub the stripe if it matches both device and physical
  offset.

  So even by some how we got a dev extent which is no longer owned
  by the target chunk, we will just skip this chunk completely, without
  any consequence.

Reported-by: Su Yue <l@damenly.su>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Add a new quick exit with extra comments

- Add a new comment in the existing safenet in scrub_chunk()
---
 fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 11089568b287..1c3ed4720ddd 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 		goto out;
 
 	map = em->map_lookup;
+
+	/*
+	 * Due to the delayed modification of device tree, this chunk
+	 * may not own this dev extent.
+	 *
+	 * So we need to iterate through all stripes, and only scrub
+	 * the stripe which matches both device and physical offset.
+	 */
 	for (i = 0; i < map->num_stripes; ++i) {
 		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
 		    map->stripes[i].physical == dev_offset) {
@@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		if (!cache)
 			goto skip;
 
+		/*
+		 * Since dev extents modification is delayed, it's possible
+		 * we got a bg which doesn't really own this dev extent.
+		 *
+		 * Here just do a quick skip, scrub_chunk() has a more
+		 * comprehensive check in it.
+		 */
+		if (cache->start != chunk_offset) {
+			btrfs_put_block_group(cache);
+			goto skip;
+		}
+
 		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
 			spin_lock(&cache->lock);
 			if (!cache->to_copy) {
@@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		dev_replace->item_needs_writeback = 1;
 		up_write(&dev_replace->rwsem);
 
-		ASSERT(cache->start == chunk_offset);
 		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
 				  dev_extent_len);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
  2022-02-17  9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
@ 2022-02-17 11:14 ` Filipe Manana
  2022-02-17 11:24   ` Qu Wenruo
  2022-04-19 20:34 ` David Sterba
  1 sibling, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2022-02-17 11:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Su Yue

On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
> [BUG]
> Su reported that with his VM running on an M1 mac, it has a high chance
> to trigger the following ASSERT() with scrub and balance test cases:
> 
> 		ASSERT(cache->start == chunk_offset);
> 		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> 				  dev_extent_len);
> 
> [CAUSE]
> The ASSERT() is making sure that the block group cache (@cache) and the
> chunk offset from the device extent match.
> 
> But the truth is, block group caches and dev extent items are deleted at
> different timing.
> 
> For block group caches and items, after btrfs_remove_block_group() they
> will be marked deleted, but the device extent and the chunk item is
> still in dev tree and chunk tree respectively.

This is not correct, what happens is:

1) btrfs_remove_chunk() will remove the device extent items and the chunk
   item;

2) It then calls btrfs_remove_block_group(). This will not remove the
   extent map if the block group was frozen (due to trim, and scrub
   itself). But it will remove the block group (struct btrfs_block_group)
   from the red black tree of block groups, and mark the block group
   as deleted (set struct btrfs_block_group::removed to 1).

   If the extent map was not removed, meaning the block group is frozen,
   then no one will be able to create a new block group with the same logical
   address before the block group is unfrozen (by someone who is holding
   a reference on it). So a lookup on the red black tree will return NULL
   for the logical address until the block group is unfrozen and its
   logical address is reused for a new block group.

So the ordering of what you are saying is reversed - the device extent
and chunk items are removed before marking the block group as deleted.

> 
> The device extents and chunk items are only deleted in
> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
> or btrfs_relocate_chunk().

This is also confusing because it gives a wrong idea of the ordering.
The device extents and chunk items are removed by btrfs_remove_chunk(),
which happens before marking a block group as deleted.

> 
> This timing difference leaves a window where we can have a mismatch
> between block group cache and device extent tree, and triggering the
> ASSERT().

I would like to see more details on this. The sequence of steps between
two tasks that result in this assertion being triggered.

> 
> [FIX]
> 
> - Remove the ASSERT()
> 
> - Add a quick exit if our found bg doesn't match @chunk_offset

This shouldn't happen. I would like to know the sequence of steps
between 2 (or more) tasks that leads to that.

We are getting the block group with btrfs_lookup_block_group() at
scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
with the "contains" argument set to 1, meaning it can return a
block group that contains the given bytenr but does not start at
that bytenr necessarily.

This gives me the idea a small block group was deleted and then a
new one was allocated which starts at a lower logical offset and
includes "chunk_offset" (ends beyond that).

We should probably be using btrfs_lookup_first_block_group() at
scrub_enumerate_chunks(), so it looks for a block group that starts
exactly at the given logical address.

But anyway, as I read this and see the patch's diff, this feels a lot
like a "bail out if something unexpected happens but we don't know
exactly why/how that is possible".

> 
> - Add a comment on the existing checks in scrub_chunk()
>   This is the ultimate safenet, as it will iterate through all the stripes
>   of the found chunk.
>   And only scrub the stripe if it matches both device and physical
>   offset.
> 
>   So even by some how we got a dev extent which is no longer owned
>   by the target chunk, we will just skip this chunk completely, without
>   any consequence.
> 
> Reported-by: Su Yue <l@damenly.su>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Add a new quick exit with extra comments
> 
> - Add a new comment in the existing safenet in scrub_chunk()
> ---
>  fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 11089568b287..1c3ed4720ddd 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>  		goto out;
>  
>  	map = em->map_lookup;
> +
> +	/*
> +	 * Due to the delayed modification of device tree, this chunk
> +	 * may not own this dev extent.

This is confusing. What delayed modification, what do you mean by it
exactly? Same below, with more details.

> +	 *
> +	 * So we need to iterate through all stripes, and only scrub
> +	 * the stripe which matches both device and physical offset.
> +	 */
>  	for (i = 0; i < map->num_stripes; ++i) {
>  		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
>  		    map->stripes[i].physical == dev_offset) {
> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>  		if (!cache)
>  			goto skip;
>  
> +		/*
> +		 * Since dev extents modification is delayed, it's possible
> +		 * we got a bg which doesn't really own this dev extent.

Same as before. Too confusing, what is delayed dev extent modification?

Once added, device extents can only be deleted, they are never modified.
I think you are referring to the deletions and the fact we use the commit
roots to find extents, but that should be a more clear comment, without
such level of ambiguity.

Thanks.

> +		 *
> +		 * Here just do a quick skip, scrub_chunk() has a more
> +		 * comprehensive check in it.
> +		 */
> +		if (cache->start != chunk_offset) {
> +			btrfs_put_block_group(cache);
> +			goto skip;
> +		}
> +
>  		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>  			spin_lock(&cache->lock);
>  			if (!cache->to_copy) {
> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>  		dev_replace->item_needs_writeback = 1;
>  		up_write(&dev_replace->rwsem);
>  
> -		ASSERT(cache->start == chunk_offset);
>  		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>  				  dev_extent_len);
>  
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
  2022-02-17 11:14 ` Filipe Manana
@ 2022-02-17 11:24   ` Qu Wenruo
  2022-02-17 12:32     ` Filipe Manana
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2022-02-17 11:24 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, Su Yue



On 2022/2/17 19:14, Filipe Manana wrote:
> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>> [BUG]
>> Su reported that with his VM running on an M1 mac, it has a high chance
>> to trigger the following ASSERT() with scrub and balance test cases:
>>
>> 		ASSERT(cache->start == chunk_offset);
>> 		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>> 				  dev_extent_len);
>>
>> [CAUSE]
>> The ASSERT() is making sure that the block group cache (@cache) and the
>> chunk offset from the device extent match.
>>
>> But the truth is, block group caches and dev extent items are deleted at
>> different timing.
>>
>> For block group caches and items, after btrfs_remove_block_group() they
>> will be marked deleted, but the device extent and the chunk item is
>> still in dev tree and chunk tree respectively.
>
> This is not correct, what happens is:
>
> 1) btrfs_remove_chunk() will remove the device extent items and the chunk
>     item;
>
> 2) It then calls btrfs_remove_block_group(). This will not remove the
>     extent map if the block group was frozen (due to trim, and scrub
>     itself). But it will remove the block group (struct btrfs_block_group)
>     from the red black tree of block groups, and mark the block group
>     as deleted (set struct btrfs_block_group::removed to 1).
>
>     If the extent map was not removed, meaning the block group is frozen,
>     then no one will be able to create a new block group with the same logical
>     address before the block group is unfrozen (by someone who is holding
>     a reference on it). So a lookup on the red black tree will return NULL
>     for the logical address until the block group is unfrozen and its
>     logical address is reused for a new block group.
>
> So the ordering of what you are saying is reversed - the device extent
> and chunk items are removed before marking the block group as deleted.
>
>>
>> The device extents and chunk items are only deleted in
>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
>> or btrfs_relocate_chunk().
>
> This is also confusing because it gives a wrong idea of the ordering.
> The device extents and chunk items are removed by btrfs_remove_chunk(),
> which happens before marking a block group as deleted.
>
>>
>> This timing difference leaves a window where we can have a mismatch
>> between block group cache and device extent tree, and triggering the
>> ASSERT().
>
> I would like to see more details on this. The sequence of steps between
> two tasks that result in this assertion being triggered.
>
>>
>> [FIX]
>>
>> - Remove the ASSERT()
>>
>> - Add a quick exit if our found bg doesn't match @chunk_offset
>
> This shouldn't happen. I would like to know the sequence of steps
> between 2 (or more) tasks that leads to that.
>
> We are getting the block group with btrfs_lookup_block_group() at
> scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
> with the "contains" argument set to 1, meaning it can return a
> block group that contains the given bytenr but does not start at
> that bytenr necessarily.
>
> This gives me the idea a small block group was deleted and then a
> new one was allocated which starts at a lower logical offset and
> includes "chunk_offset" (ends beyond that).
>
> We should probably be using btrfs_lookup_first_block_group() at
> scrub_enumerate_chunks(), so it looks for a block group that starts
> exactly at the given logical address.
>
> But anyway, as I read this and see the patch's diff, this feels a lot
> like a "bail out if something unexpected happens but we don't know
> exactly why/how that is possible".

You're completely correct.

Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all
my ARM VMs are too slow to reproduce.

Furthermore, due to some unrelated bugs, v5.17-rc based kernels all
failed to boot inside the VMs on Apple M1.

So unfortunately I don't have extra details for now.

Will find a way to reproduce first, then update the patch with better
comments.

Thanks,
Qu

>
>>
>> - Add a comment on the existing checks in scrub_chunk()
>>    This is the ultimate safenet, as it will iterate through all the stripes
>>    of the found chunk.
>>    And only scrub the stripe if it matches both device and physical
>>    offset.
>>
>>    So even by some how we got a dev extent which is no longer owned
>>    by the target chunk, we will just skip this chunk completely, without
>>    any consequence.
>>
>> Reported-by: Su Yue <l@damenly.su>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> v2:
>> - Add a new quick exit with extra comments
>>
>> - Add a new comment in the existing safenet in scrub_chunk()
>> ---
>>   fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>>   1 file changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index 11089568b287..1c3ed4720ddd 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>>   		goto out;
>>
>>   	map = em->map_lookup;
>> +
>> +	/*
>> +	 * Due to the delayed modification of device tree, this chunk
>> +	 * may not own this dev extent.
>
> This is confusing. What delayed modification, what do you mean by it
> exactly? Same below, with more details.
>
>> +	 *
>> +	 * So we need to iterate through all stripes, and only scrub
>> +	 * the stripe which matches both device and physical offset.
>> +	 */
>>   	for (i = 0; i < map->num_stripes; ++i) {
>>   		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
>>   		    map->stripes[i].physical == dev_offset) {
>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>   		if (!cache)
>>   			goto skip;
>>
>> +		/*
>> +		 * Since dev extents modification is delayed, it's possible
>> +		 * we got a bg which doesn't really own this dev extent.
>
> Same as before. Too confusing, what is delayed dev extent modification?
>
> Once added, device extents can only be deleted, they are never modified.
> I think you are referring to the deletions and the fact we use the commit
> roots to find extents, but that should be a more clear comment, without
> such level of ambiguity.
>
> Thanks.
>
>> +		 *
>> +		 * Here just do a quick skip, scrub_chunk() has a more
>> +		 * comprehensive check in it.
>> +		 */
>> +		if (cache->start != chunk_offset) {
>> +			btrfs_put_block_group(cache);
>> +			goto skip;
>> +		}
>> +
>>   		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>>   			spin_lock(&cache->lock);
>>   			if (!cache->to_copy) {
>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>   		dev_replace->item_needs_writeback = 1;
>>   		up_write(&dev_replace->rwsem);
>>
>> -		ASSERT(cache->start == chunk_offset);
>>   		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>   				  dev_extent_len);
>>
>> --
>> 2.35.1
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
  2022-02-17 11:24   ` Qu Wenruo
@ 2022-02-17 12:32     ` Filipe Manana
  2022-02-17 13:00       ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2022-02-17 12:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Su Yue

On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/2/17 19:14, Filipe Manana wrote:
> > On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
> >> [BUG]
> >> Su reported that with his VM running on an M1 mac, it has a high chance
> >> to trigger the following ASSERT() with scrub and balance test cases:
> >>
> >>              ASSERT(cache->start == chunk_offset);
> >>              ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> >>                                dev_extent_len);
> >>
> >> [CAUSE]
> >> The ASSERT() is making sure that the block group cache (@cache) and the
> >> chunk offset from the device extent match.
> >>
> >> But the truth is, block group caches and dev extent items are deleted at
> >> different timing.
> >>
> >> For block group caches and items, after btrfs_remove_block_group() they
> >> will be marked deleted, but the device extent and the chunk item is
> >> still in dev tree and chunk tree respectively.
> >
> > This is not correct, what happens is:
> >
> > 1) btrfs_remove_chunk() will remove the device extent items and the chunk
> >     item;
> >
> > 2) It then calls btrfs_remove_block_group(). This will not remove the
> >     extent map if the block group was frozen (due to trim, and scrub
> >     itself). But it will remove the block group (struct btrfs_block_group)
> >     from the red black tree of block groups, and mark the block group
> >     as deleted (set struct btrfs_block_group::removed to 1).
> >
> >     If the extent map was not removed, meaning the block group is frozen,
> >     then no one will be able to create a new block group with the same logical
> >     address before the block group is unfrozen (by someone who is holding
> >     a reference on it). So a lookup on the red black tree will return NULL
> >     for the logical address until the block group is unfrozen and its
> >     logical address is reused for a new block group.
> >
> > So the ordering of what you are saying is reversed - the device extent
> > and chunk items are removed before marking the block group as deleted.
> >
> >>
> >> The device extents and chunk items are only deleted in
> >> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
> >> or btrfs_relocate_chunk().
> >
> > This is also confusing because it gives a wrong idea of the ordering.
> > The device extents and chunk items are removed by btrfs_remove_chunk(),
> > which happens before marking a block group as deleted.
> >
> >>
> >> This timing difference leaves a window where we can have a mismatch
> >> between block group cache and device extent tree, and triggering the
> >> ASSERT().
> >
> > I would like to see more details on this. The sequence of steps between
> > two tasks that result in this assertion being triggered.
> >
> >>
> >> [FIX]
> >>
> >> - Remove the ASSERT()
> >>
> >> - Add a quick exit if our found bg doesn't match @chunk_offset
> >
> > This shouldn't happen. I would like to know the sequence of steps
> > between 2 (or more) tasks that leads to that.
> >
> > We are getting the block group with btrfs_lookup_block_group() at
> > scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
> > with the "contains" argument set to 1, meaning it can return a
> > block group that contains the given bytenr but does not start at
> > that bytenr necessarily.
> >
> > This gives me the idea a small block group was deleted and then a
> > new one was allocated which starts at a lower logical offset and
> > includes "chunk_offset" (ends beyond that).
> >
> > We should probably be using btrfs_lookup_first_block_group() at
> > scrub_enumerate_chunks(), so it looks for a block group that starts
> > exactly at the given logical address.
> >
> > But anyway, as I read this and see the patch's diff, this feels a lot
> > like a "bail out if something unexpected happens but we don't know
> > exactly why/how that is possible".
>
> You're completely correct.
>
> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all
> my ARM VMs are too slow to reproduce.
>
> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all
> failed to boot inside the VMs on Apple M1.
>
> So unfortunately I don't have extra details for now.
>
> Will find a way to reproduce first, then update the patch with better
> comments.

Was it triggered with some specific test case from fstests?

>
> Thanks,
> Qu
>
> >
> >>
> >> - Add a comment on the existing checks in scrub_chunk()
> >>    This is the ultimate safenet, as it will iterate through all the stripes
> >>    of the found chunk.
> >>    And only scrub the stripe if it matches both device and physical
> >>    offset.
> >>
> >>    So even by some how we got a dev extent which is no longer owned
> >>    by the target chunk, we will just skip this chunk completely, without
> >>    any consequence.
> >>
> >> Reported-by: Su Yue <l@damenly.su>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Changelog:
> >> v2:
> >> - Add a new quick exit with extra comments
> >>
> >> - Add a new comment in the existing safenet in scrub_chunk()
> >> ---
> >>   fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
> >>   1 file changed, 20 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >> index 11089568b287..1c3ed4720ddd 100644
> >> --- a/fs/btrfs/scrub.c
> >> +++ b/fs/btrfs/scrub.c
> >> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
> >>              goto out;
> >>
> >>      map = em->map_lookup;
> >> +
> >> +    /*
> >> +     * Due to the delayed modification of device tree, this chunk
> >> +     * may not own this dev extent.
> >
> > This is confusing. What delayed modification, what do you mean by it
> > exactly? Same below, with more details.
> >
> >> +     *
> >> +     * So we need to iterate through all stripes, and only scrub
> >> +     * the stripe which matches both device and physical offset.
> >> +     */
> >>      for (i = 0; i < map->num_stripes; ++i) {
> >>              if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
> >>                  map->stripes[i].physical == dev_offset) {
> >> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>              if (!cache)
> >>                      goto skip;
> >>
> >> +            /*
> >> +             * Since dev extents modification is delayed, it's possible
> >> +             * we got a bg which doesn't really own this dev extent.
> >
> > Same as before. Too confusing, what is delayed dev extent modification?
> >
> > Once added, device extents can only be deleted, they are never modified.
> > I think you are referring to the deletions and the fact we use the commit
> > roots to find extents, but that should be a more clear comment, without
> > such level of ambiguity.
> >
> > Thanks.
> >
> >> +             *
> >> +             * Here just do a quick skip, scrub_chunk() has a more
> >> +             * comprehensive check in it.
> >> +             */
> >> +            if (cache->start != chunk_offset) {
> >> +                    btrfs_put_block_group(cache);
> >> +                    goto skip;
> >> +            }
> >> +
> >>              if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
> >>                      spin_lock(&cache->lock);
> >>                      if (!cache->to_copy) {
> >> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>              dev_replace->item_needs_writeback = 1;
> >>              up_write(&dev_replace->rwsem);
> >>
> >> -            ASSERT(cache->start == chunk_offset);
> >>              ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> >>                                dev_extent_len);
> >>
> >> --
> >> 2.35.1
> >>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
  2022-02-17 12:32     ` Filipe Manana
@ 2022-02-17 13:00       ` Qu Wenruo
  2022-02-17 13:02         ` Filipe Manana
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2022-02-17 13:00 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs, Su Yue



On 2022/2/17 20:32, Filipe Manana wrote:
> On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/2/17 19:14, Filipe Manana wrote:
>>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>>>> [BUG]
>>>> Su reported that with his VM running on an M1 mac, it has a high chance
>>>> to trigger the following ASSERT() with scrub and balance test cases:
>>>>
>>>>               ASSERT(cache->start == chunk_offset);
>>>>               ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>>>                                 dev_extent_len);
>>>>
>>>> [CAUSE]
>>>> The ASSERT() is making sure that the block group cache (@cache) and the
>>>> chunk offset from the device extent match.
>>>>
>>>> But the truth is, block group caches and dev extent items are deleted at
>>>> different timing.
>>>>
>>>> For block group caches and items, after btrfs_remove_block_group() they
>>>> will be marked deleted, but the device extent and the chunk item is
>>>> still in dev tree and chunk tree respectively.
>>>
>>> This is not correct, what happens is:
>>>
>>> 1) btrfs_remove_chunk() will remove the device extent items and the chunk
>>>      item;
>>>
>>> 2) It then calls btrfs_remove_block_group(). This will not remove the
>>>      extent map if the block group was frozen (due to trim, and scrub
>>>      itself). But it will remove the block group (struct btrfs_block_group)
>>>      from the red black tree of block groups, and mark the block group
>>>      as deleted (set struct btrfs_block_group::removed to 1).
>>>
>>>      If the extent map was not removed, meaning the block group is frozen,
>>>      then no one will be able to create a new block group with the same logical
>>>      address before the block group is unfrozen (by someone who is holding
>>>      a reference on it). So a lookup on the red black tree will return NULL
>>>      for the logical address until the block group is unfrozen and its
>>>      logical address is reused for a new block group.
>>>
>>> So the ordering of what you are saying is reversed - the device extent
>>> and chunk items are removed before marking the block group as deleted.
>>>
>>>>
>>>> The device extents and chunk items are only deleted in
>>>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
>>>> or btrfs_relocate_chunk().
>>>
>>> This is also confusing because it gives a wrong idea of the ordering.
>>> The device extents and chunk items are removed by btrfs_remove_chunk(),
>>> which happens before marking a block group as deleted.
>>>
>>>>
>>>> This timing difference leaves a window where we can have a mismatch
>>>> between block group cache and device extent tree, and triggering the
>>>> ASSERT().
>>>
>>> I would like to see more details on this. The sequence of steps between
>>> two tasks that result in this assertion being triggered.
>>>
>>>>
>>>> [FIX]
>>>>
>>>> - Remove the ASSERT()
>>>>
>>>> - Add a quick exit if our found bg doesn't match @chunk_offset
>>>
>>> This shouldn't happen. I would like to know the sequence of steps
>>> between 2 (or more) tasks that leads to that.
>>>
>>> We are getting the block group with btrfs_lookup_block_group() at
>>> scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
>>> with the "contains" argument set to 1, meaning it can return a
>>> block group that contains the given bytenr but does not start at
>>> that bytenr necessarily.
>>>
>>> This gives me the idea a small block group was deleted and then a
>>> new one was allocated which starts at a lower logical offset and
>>> includes "chunk_offset" (ends beyond that).
>>>
>>> We should probably be using btrfs_lookup_first_block_group() at
>>> scrub_enumerate_chunks(), so it looks for a block group that starts
>>> exactly at the given logical address.
>>>
>>> But anyway, as I read this and see the patch's diff, this feels a lot
>>> like a "bail out if something unexpected happens but we don't know
>>> exactly why/how that is possible".
>>
>> You're completely correct.
>>
>> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all
>> my ARM VMs are too slow to reproduce.
>>
>> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all
>> failed to boot inside the VMs on Apple M1.
>>
>> So unfortunately I don't have extra details for now.
>>
>> Will find a way to reproduce first, then update the patch with better
>> comments.
>
> Was it triggered with some specific test case from fstests?

Yes, from a regular run on fstests.

But that was on an older branch (some older misc-next, which has the
patch but still based on v5.16).

The v5.17-rc based branch will not boot inside the ARM vm at all, and no
early boot messages after the EFI routine.

Will attack the boot problem first.

Thanks,
Qu
>
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> - Add a comment on the existing checks in scrub_chunk()
>>>>     This is the ultimate safenet, as it will iterate through all the stripes
>>>>     of the found chunk.
>>>>     And only scrub the stripe if it matches both device and physical
>>>>     offset.
>>>>
>>>>     So even by some how we got a dev extent which is no longer owned
>>>>     by the target chunk, we will just skip this chunk completely, without
>>>>     any consequence.
>>>>
>>>> Reported-by: Su Yue <l@damenly.su>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> Changelog:
>>>> v2:
>>>> - Add a new quick exit with extra comments
>>>>
>>>> - Add a new comment in the existing safenet in scrub_chunk()
>>>> ---
>>>>    fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>>>>    1 file changed, 20 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>>> index 11089568b287..1c3ed4720ddd 100644
>>>> --- a/fs/btrfs/scrub.c
>>>> +++ b/fs/btrfs/scrub.c
>>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>>>>               goto out;
>>>>
>>>>       map = em->map_lookup;
>>>> +
>>>> +    /*
>>>> +     * Due to the delayed modification of device tree, this chunk
>>>> +     * may not own this dev extent.
>>>
>>> This is confusing. What delayed modification, what do you mean by it
>>> exactly? Same below, with more details.
>>>
>>>> +     *
>>>> +     * So we need to iterate through all stripes, and only scrub
>>>> +     * the stripe which matches both device and physical offset.
>>>> +     */
>>>>       for (i = 0; i < map->num_stripes; ++i) {
>>>>               if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
>>>>                   map->stripes[i].physical == dev_offset) {
>>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>>>               if (!cache)
>>>>                       goto skip;
>>>>
>>>> +            /*
>>>> +             * Since dev extents modification is delayed, it's possible
>>>> +             * we got a bg which doesn't really own this dev extent.
>>>
>>> Same as before. Too confusing, what is delayed dev extent modification?
>>>
>>> Once added, device extents can only be deleted, they are never modified.
>>> I think you are referring to the deletions and the fact we use the commit
>>> roots to find extents, but that should be a more clear comment, without
>>> such level of ambiguity.
>>>
>>> Thanks.
>>>
>>>> +             *
>>>> +             * Here just do a quick skip, scrub_chunk() has a more
>>>> +             * comprehensive check in it.
>>>> +             */
>>>> +            if (cache->start != chunk_offset) {
>>>> +                    btrfs_put_block_group(cache);
>>>> +                    goto skip;
>>>> +            }
>>>> +
>>>>               if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>>>>                       spin_lock(&cache->lock);
>>>>                       if (!cache->to_copy) {
>>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>>>               dev_replace->item_needs_writeback = 1;
>>>>               up_write(&dev_replace->rwsem);
>>>>
>>>> -            ASSERT(cache->start == chunk_offset);
>>>>               ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>>>                                 dev_extent_len);
>>>>
>>>> --
>>>> 2.35.1
>>>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
  2022-02-17 13:00       ` Qu Wenruo
@ 2022-02-17 13:02         ` Filipe Manana
  2022-02-17 13:25           ` Qu Wenruo
  2022-02-18  2:29           ` Su Yue
  0 siblings, 2 replies; 9+ messages in thread
From: Filipe Manana @ 2022-02-17 13:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Su Yue

On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/2/17 20:32, Filipe Manana wrote:
> > On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2022/2/17 19:14, Filipe Manana wrote:
> >>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
> >>>> [BUG]
> >>>> Su reported that with his VM running on an M1 mac, it has a high chance
> >>>> to trigger the following ASSERT() with scrub and balance test cases:
> >>>>
> >>>>               ASSERT(cache->start == chunk_offset);
> >>>>               ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> >>>>                                 dev_extent_len);
> >>>>
> >>>> [CAUSE]
> >>>> The ASSERT() is making sure that the block group cache (@cache) and the
> >>>> chunk offset from the device extent match.
> >>>>
> >>>> But the truth is, block group caches and dev extent items are deleted at
> >>>> different timing.
> >>>>
> >>>> For block group caches and items, after btrfs_remove_block_group() they
> >>>> will be marked deleted, but the device extent and the chunk item is
> >>>> still in dev tree and chunk tree respectively.
> >>>
> >>> This is not correct, what happens is:
> >>>
> >>> 1) btrfs_remove_chunk() will remove the device extent items and the chunk
> >>>      item;
> >>>
> >>> 2) It then calls btrfs_remove_block_group(). This will not remove the
> >>>      extent map if the block group was frozen (due to trim, and scrub
> >>>      itself). But it will remove the block group (struct btrfs_block_group)
> >>>      from the red black tree of block groups, and mark the block group
> >>>      as deleted (set struct btrfs_block_group::removed to 1).
> >>>
> >>>      If the extent map was not removed, meaning the block group is frozen,
> >>>      then no one will be able to create a new block group with the same logical
> >>>      address before the block group is unfrozen (by someone who is holding
> >>>      a reference on it). So a lookup on the red black tree will return NULL
> >>>      for the logical address until the block group is unfrozen and its
> >>>      logical address is reused for a new block group.
> >>>
> >>> So the ordering of what you are saying is reversed - the device extent
> >>> and chunk items are removed before marking the block group as deleted.
> >>>
> >>>>
> >>>> The device extents and chunk items are only deleted in
> >>>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
> >>>> or btrfs_relocate_chunk().
> >>>
> >>> This is also confusing because it gives a wrong idea of the ordering.
> >>> The device extents and chunk items are removed by btrfs_remove_chunk(),
> >>> which happens before marking a block group as deleted.
> >>>
> >>>>
> >>>> This timing difference leaves a window where we can have a mismatch
> >>>> between block group cache and device extent tree, and triggering the
> >>>> ASSERT().
> >>>
> >>> I would like to see more details on this. The sequence of steps between
> >>> two tasks that result in this assertion being triggered.
> >>>
> >>>>
> >>>> [FIX]
> >>>>
> >>>> - Remove the ASSERT()
> >>>>
> >>>> - Add a quick exit if our found bg doesn't match @chunk_offset
> >>>
> >>> This shouldn't happen. I would like to know the sequence of steps
> >>> between 2 (or more) tasks that leads to that.
> >>>
> >>> We are getting the block group with btrfs_lookup_block_group() at
> >>> scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
> >>> with the "contains" argument set to 1, meaning it can return a
> >>> block group that contains the given bytenr but does not start at
> >>> that bytenr necessarily.
> >>>
> >>> This gives me the idea a small block group was deleted and then a
> >>> new one was allocated which starts at a lower logical offset and
> >>> includes "chunk_offset" (ends beyond that).
> >>>
> >>> We should probably be using btrfs_lookup_first_block_group() at
> >>> scrub_enumerate_chunks(), so it looks for a block group that starts
> >>> exactly at the given logical address.
> >>>
> >>> But anyway, as I read this and see the patch's diff, this feels a lot
> >>> like a "bail out if something unexpected happens but we don't know
> >>> exactly why/how that is possible".
> >>
> >> You're completely correct.
> >>
> >> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all
> >> my ARM VMs are too slow to reproduce.
> >>
> >> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all
> >> failed to boot inside the VMs on Apple M1.
> >>
> >> So unfortunately I don't have extra details for now.
> >>
> >> Will find a way to reproduce first, then update the patch with better
> >> comments.
> >
> > Was it triggered with some specific test case from fstests?
>
> Yes, from a regular run on fstests.

Ok, but which specific test(s)?

>
> But that was on an older branch (some older misc-next, which has the
> patch but still based on v5.16).
>
> The v5.17-rc based branch will not boot inside the ARM vm at all, and no
> early boot messages after the EFI routine.
>
> Will attack the boot problem first.
>
> Thanks,
> Qu
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>>>
> >>>> - Add a comment on the existing checks in scrub_chunk()
> >>>>     This is the ultimate safenet, as it will iterate through all the stripes
> >>>>     of the found chunk.
> >>>>     And only scrub the stripe if it matches both device and physical
> >>>>     offset.
> >>>>
> >>>>     So even by some how we got a dev extent which is no longer owned
> >>>>     by the target chunk, we will just skip this chunk completely, without
> >>>>     any consequence.
> >>>>
> >>>> Reported-by: Su Yue <l@damenly.su>
> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>>> ---
> >>>> Changelog:
> >>>> v2:
> >>>> - Add a new quick exit with extra comments
> >>>>
> >>>> - Add a new comment in the existing safenet in scrub_chunk()
> >>>> ---
> >>>>    fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
> >>>>    1 file changed, 20 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >>>> index 11089568b287..1c3ed4720ddd 100644
> >>>> --- a/fs/btrfs/scrub.c
> >>>> +++ b/fs/btrfs/scrub.c
> >>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
> >>>>               goto out;
> >>>>
> >>>>       map = em->map_lookup;
> >>>> +
> >>>> +    /*
> >>>> +     * Due to the delayed modification of device tree, this chunk
> >>>> +     * may not own this dev extent.
> >>>
> >>> This is confusing. What delayed modification, what do you mean by it
> >>> exactly? Same below, with more details.
> >>>
> >>>> +     *
> >>>> +     * So we need to iterate through all stripes, and only scrub
> >>>> +     * the stripe which matches both device and physical offset.
> >>>> +     */
> >>>>       for (i = 0; i < map->num_stripes; ++i) {
> >>>>               if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
> >>>>                   map->stripes[i].physical == dev_offset) {
> >>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>>>               if (!cache)
> >>>>                       goto skip;
> >>>>
> >>>> +            /*
> >>>> +             * Since dev extents modification is delayed, it's possible
> >>>> +             * we got a bg which doesn't really own this dev extent.
> >>>
> >>> Same as before. Too confusing, what is delayed dev extent modification?
> >>>
> >>> Once added, device extents can only be deleted, they are never modified.
> >>> I think you are referring to the deletions and the fact we use the commit
> >>> roots to find extents, but that should be a more clear comment, without
> >>> such level of ambiguity.
> >>>
> >>> Thanks.
> >>>
> >>>> +             *
> >>>> +             * Here just do a quick skip, scrub_chunk() has a more
> >>>> +             * comprehensive check in it.
> >>>> +             */
> >>>> +            if (cache->start != chunk_offset) {
> >>>> +                    btrfs_put_block_group(cache);
> >>>> +                    goto skip;
> >>>> +            }
> >>>> +
> >>>>               if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
> >>>>                       spin_lock(&cache->lock);
> >>>>                       if (!cache->to_copy) {
> >>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>>>               dev_replace->item_needs_writeback = 1;
> >>>>               up_write(&dev_replace->rwsem);
> >>>>
> >>>> -            ASSERT(cache->start == chunk_offset);
> >>>>               ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> >>>>                                 dev_extent_len);
> >>>>
> >>>> --
> >>>> 2.35.1
> >>>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
  2022-02-17 13:02         ` Filipe Manana
@ 2022-02-17 13:25           ` Qu Wenruo
  2022-02-18  2:29           ` Su Yue
  1 sibling, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2022-02-17 13:25 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, Su Yue



On 2022/2/17 21:02, Filipe Manana wrote:
> On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/2/17 20:32, Filipe Manana wrote:
>>> On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2022/2/17 19:14, Filipe Manana wrote:
>>>>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>>>>>> [BUG]
>>>>>> Su reported that with his VM running on an M1 mac, it has a high chance
>>>>>> to trigger the following ASSERT() with scrub and balance test cases:
>>>>>>
>>>>>>                ASSERT(cache->start == chunk_offset);
>>>>>>                ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>>>>>                                  dev_extent_len);
>>>>>>
>>>>>> [CAUSE]
>>>>>> The ASSERT() is making sure that the block group cache (@cache) and the
>>>>>> chunk offset from the device extent match.
>>>>>>
>>>>>> But the truth is, block group caches and dev extent items are deleted at
>>>>>> different timing.
>>>>>>
>>>>>> For block group caches and items, after btrfs_remove_block_group() they
>>>>>> will be marked deleted, but the device extent and the chunk item is
>>>>>> still in dev tree and chunk tree respectively.
>>>>>
>>>>> This is not correct, what happens is:
>>>>>
>>>>> 1) btrfs_remove_chunk() will remove the device extent items and the chunk
>>>>>       item;
>>>>>
>>>>> 2) It then calls btrfs_remove_block_group(). This will not remove the
>>>>>       extent map if the block group was frozen (due to trim, and scrub
>>>>>       itself). But it will remove the block group (struct btrfs_block_group)
>>>>>       from the red black tree of block groups, and mark the block group
>>>>>       as deleted (set struct btrfs_block_group::removed to 1).
>>>>>
>>>>>       If the extent map was not removed, meaning the block group is frozen,
>>>>>       then no one will be able to create a new block group with the same logical
>>>>>       address before the block group is unfrozen (by someone who is holding
>>>>>       a reference on it). So a lookup on the red black tree will return NULL
>>>>>       for the logical address until the block group is unfrozen and its
>>>>>       logical address is reused for a new block group.
>>>>>
>>>>> So the ordering of what you are saying is reversed - the device extent
>>>>> and chunk items are removed before marking the block group as deleted.
>>>>>
>>>>>>
>>>>>> The device extents and chunk items are only deleted in
>>>>>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
>>>>>> or btrfs_relocate_chunk().
>>>>>
>>>>> This is also confusing because it gives a wrong idea of the ordering.
>>>>> The device extents and chunk items are removed by btrfs_remove_chunk(),
>>>>> which happens before marking a block group as deleted.
>>>>>
>>>>>>
>>>>>> This timing difference leaves a window where we can have a mismatch
>>>>>> between block group cache and device extent tree, and triggering the
>>>>>> ASSERT().
>>>>>
>>>>> I would like to see more details on this. The sequence of steps between
>>>>> two tasks that result in this assertion being triggered.
>>>>>
>>>>>>
>>>>>> [FIX]
>>>>>>
>>>>>> - Remove the ASSERT()
>>>>>>
>>>>>> - Add a quick exit if our found bg doesn't match @chunk_offset
>>>>>
>>>>> This shouldn't happen. I would like to know the sequence of steps
>>>>> between 2 (or more) tasks that leads to that.
>>>>>
>>>>> We are getting the block group with btrfs_lookup_block_group() at
>>>>> scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
>>>>> with the "contains" argument set to 1, meaning it can return a
>>>>> block group that contains the given bytenr but does not start at
>>>>> that bytenr necessarily.
>>>>>
>>>>> This gives me the idea a small block group was deleted and then a
>>>>> new one was allocated which starts at a lower logical offset and
>>>>> includes "chunk_offset" (ends beyond that).
>>>>>
>>>>> We should probably be using btrfs_lookup_first_block_group() at
>>>>> scrub_enumerate_chunks(), so it looks for a block group that starts
>>>>> exactly at the given logical address.
>>>>>
>>>>> But anyway, as I read this and see the patch's diff, this feels a lot
>>>>> like a "bail out if something unexpected happens but we don't know
>>>>> exactly why/how that is possible".
>>>>
>>>> You're completely correct.
>>>>
>>>> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all
>>>> my ARM VMs are too slow to reproduce.
>>>>
>>>> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all
>>>> failed to boot inside the VMs on Apple M1.
>>>>
>>>> So unfortunately I don't have extra details for now.
>>>>
>>>> Will find a way to reproduce first, then update the patch with better
>>>> comments.
>>>
>>> Was it triggered with some specific test case from fstests?
>>
>> Yes, from a regular run on fstests.
> 
> Ok, but which specific test(s)?

No mention, but I guess it's inside scrub + replace group.

> 
>>
>> But that was on an older branch (some older misc-next, which has the
>> patch but still based on v5.16).
>>
>> The v5.17-rc based branch will not boot inside the ARM vm at all, and no
>> early boot messages after the EFI routine.
>>
>> Will attack the boot problem first.
>>
>> Thanks,
>> Qu
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>>>
>>>>>> - Add a comment on the existing checks in scrub_chunk()
>>>>>>      This is the ultimate safenet, as it will iterate through all the stripes
>>>>>>      of the found chunk.
>>>>>>      And only scrub the stripe if it matches both device and physical
>>>>>>      offset.
>>>>>>
>>>>>>      So even by some how we got a dev extent which is no longer owned
>>>>>>      by the target chunk, we will just skip this chunk completely, without
>>>>>>      any consequence.
>>>>>>
>>>>>> Reported-by: Su Yue <l@damenly.su>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>> ---
>>>>>> Changelog:
>>>>>> v2:
>>>>>> - Add a new quick exit with extra comments
>>>>>>
>>>>>> - Add a new comment in the existing safenet in scrub_chunk()
>>>>>> ---
>>>>>>     fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>>>>>>     1 file changed, 20 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>>>>> index 11089568b287..1c3ed4720ddd 100644
>>>>>> --- a/fs/btrfs/scrub.c
>>>>>> +++ b/fs/btrfs/scrub.c
>>>>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>>>>>>                goto out;
>>>>>>
>>>>>>        map = em->map_lookup;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Due to the delayed modification of device tree, this chunk
>>>>>> +     * may not own this dev extent.
>>>>>
>>>>> This is confusing. What delayed modification, what do you mean by it
>>>>> exactly? Same below, with more details.
>>>>>
>>>>>> +     *
>>>>>> +     * So we need to iterate through all stripes, and only scrub
>>>>>> +     * the stripe which matches both device and physical offset.
>>>>>> +     */
>>>>>>        for (i = 0; i < map->num_stripes; ++i) {
>>>>>>                if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
>>>>>>                    map->stripes[i].physical == dev_offset) {
>>>>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>>>>>                if (!cache)
>>>>>>                        goto skip;
>>>>>>
>>>>>> +            /*
>>>>>> +             * Since dev extents modification is delayed, it's possible
>>>>>> +             * we got a bg which doesn't really own this dev extent.
>>>>>
>>>>> Same as before. Too confusing, what is delayed dev extent modification?
>>>>>
>>>>> Once added, device extents can only be deleted, they are never modified.
>>>>> I think you are referring to the deletions and the fact we use the commit
>>>>> roots to find extents, but that should be a more clear comment, without
>>>>> such level of ambiguity.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>> +             *
>>>>>> +             * Here just do a quick skip, scrub_chunk() has a more
>>>>>> +             * comprehensive check in it.
>>>>>> +             */
>>>>>> +            if (cache->start != chunk_offset) {
>>>>>> +                    btrfs_put_block_group(cache);
>>>>>> +                    goto skip;
>>>>>> +            }
>>>>>> +
>>>>>>                if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>>>>>>                        spin_lock(&cache->lock);
>>>>>>                        if (!cache->to_copy) {
>>>>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>>>>>                dev_replace->item_needs_writeback = 1;
>>>>>>                up_write(&dev_replace->rwsem);
>>>>>>
>>>>>> -            ASSERT(cache->start == chunk_offset);
>>>>>>                ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>>>>>                                  dev_extent_len);
>>>>>>
>>>>>> --
>>>>>> 2.35.1
>>>>>>
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
  2022-02-17 13:02         ` Filipe Manana
  2022-02-17 13:25           ` Qu Wenruo
@ 2022-02-18  2:29           ` Su Yue
  1 sibling, 0 replies; 9+ messages in thread
From: Su Yue @ 2022-02-18  2:29 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs


On Thu 17 Feb 2022 at 13:02, Filipe Manana <fdmanana@kernel.org> 
wrote:

> On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo 
> <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/2/17 20:32, Filipe Manana wrote:
>> > On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo 
>> > <quwenruo.btrfs@gmx.com> wrote:
>> >>
>> >>
>> >>
>> >> On 2022/2/17 19:14, Filipe Manana wrote:
>> >>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>> >>>> [BUG]
>> >>>> Su reported that with his VM running on an M1 mac, it has 
>> >>>> a high chance
>> >>>> to trigger the following ASSERT() with scrub and balance 
>> >>>> test cases:
>> >>>>
>> >>>>               ASSERT(cache->start == chunk_offset);
>> >>>>               ret = scrub_chunk(sctx, cache, scrub_dev, 
>> >>>>               found_key.offset,
>> >>>>                                 dev_extent_len);
>> >>>>
>> >>>> [CAUSE]
>> >>>> The ASSERT() is making sure that the block group cache 
>> >>>> (@cache) and the
>> >>>> chunk offset from the device extent match.
>> >>>>
>> >>>> But the truth is, block group caches and dev extent items 
>> >>>> are deleted at
>> >>>> different timing.
>> >>>>
>> >>>> For block group caches and items, after 
>> >>>> btrfs_remove_block_group() they
>> >>>> will be marked deleted, but the device extent and the 
>> >>>> chunk item is
>> >>>> still in dev tree and chunk tree respectively.
>> >>>
>> >>> This is not correct, what happens is:
>> >>>
>> >>> 1) btrfs_remove_chunk() will remove the device extent items 
>> >>> and the chunk
>> >>>      item;
>> >>>
>> >>> 2) It then calls btrfs_remove_block_group(). This will not 
>> >>> remove the
>> >>>      extent map if the block group was frozen (due to trim, 
>> >>>      and scrub
>> >>>      itself). But it will remove the block group (struct 
>> >>>      btrfs_block_group)
>> >>>      from the red black tree of block groups, and mark the 
>> >>>      block group
>> >>>      as deleted (set struct btrfs_block_group::removed to 
>> >>>      1).
>> >>>
>> >>>      If the extent map was not removed, meaning the block 
>> >>>      group is frozen,
>> >>>      then no one will be able to create a new block group 
>> >>>      with the same logical
>> >>>      address before the block group is unfrozen (by someone 
>> >>>      who is holding
>> >>>      a reference on it). So a lookup on the red black tree 
>> >>>      will return NULL
>> >>>      for the logical address until the block group is 
>> >>>      unfrozen and its
>> >>>      logical address is reused for a new block group.
>> >>>
>> >>> So the ordering of what you are saying is reversed - the 
>> >>> device extent
>> >>> and chunk items are removed before marking the block group 
>> >>> as deleted.
>> >>>
>> >>>>
>> >>>> The device extents and chunk items are only deleted in
>> >>>> btrfs_remove_chunk(), which happens either a 
>> >>>> btrfs_delete_unused_bgs()
>> >>>> or btrfs_relocate_chunk().
>> >>>
>> >>> This is also confusing because it gives a wrong idea of the 
>> >>> ordering.
>> >>> The device extents and chunk items are removed by 
>> >>> btrfs_remove_chunk(),
>> >>> which happens before marking a block group as deleted.
>> >>>
>> >>>>
>> >>>> This timing difference leaves a window where we can have a 
>> >>>> mismatch
>> >>>> between block group cache and device extent tree, and 
>> >>>> triggering the
>> >>>> ASSERT().
>> >>>
>> >>> I would like to see more details on this. The sequence of 
>> >>> steps between
>> >>> two tasks that result in this assertion being triggered.
>> >>>
>> >>>>
>> >>>> [FIX]
>> >>>>
>> >>>> - Remove the ASSERT()
>> >>>>
>> >>>> - Add a quick exit if our found bg doesn't match 
>> >>>> @chunk_offset
>> >>>
>> >>> This shouldn't happen. I would like to know the sequence of 
>> >>> steps
>> >>> between 2 (or more) tasks that leads to that.
>> >>>
>> >>> We are getting the block group with 
>> >>> btrfs_lookup_block_group() at
>> >>> scrub_enumerate_chunks(), that calls 
>> >>> block_group_cache_tree_search()
>> >>> with the "contains" argument set to 1, meaning it can 
>> >>> return a
>> >>> block group that contains the given bytenr but does not 
>> >>> start at
>> >>> that bytenr necessarily.
>> >>>
>> >>> This gives me the idea a small block group was deleted and 
>> >>> then a
>> >>> new one was allocated which starts at a lower logical 
>> >>> offset and
>> >>> includes "chunk_offset" (ends beyond that).
>> >>>
>> >>> We should probably be using 
>> >>> btrfs_lookup_first_block_group() at
>> >>> scrub_enumerate_chunks(), so it looks for a block group 
>> >>> that starts
>> >>> exactly at the given logical address.
>> >>>
>> >>> But anyway, as I read this and see the patch's diff, this 
>> >>> feels a lot
>> >>> like a "bail out if something unexpected happens but we 
>> >>> don't know
>> >>> exactly why/how that is possible".
>> >>
>> >> You're completely correct.
>> >>
>> >> Unfortunately I have no way to reproduce the bug on x86_64 
>> >> VMs, and all
>> >> my ARM VMs are too slow to reproduce.
>> >>
>> >> Furthermore, due to some unrelated bugs, v5.17-rc based 
>> >> kernels all
>> >> failed to boot inside the VMs on Apple M1.
>> >>
>> >> So unfortunately I don't have extra details for now.
>> >>
>> >> Will find a way to reproduce first, then update the patch 
>> >> with better
>> >> comments.
>> >
>> > Was it triggered with some specific test case from fstests?
>>
>> Yes, from a regular run on fstests.
>
> Ok, but which specific test(s)?
>

It is btrfs/074. It was found during early 16k subpage support 
development
of Qu. Unfortunately dmesg is lost and it can't be reproduced from 
old
branch fetched from `git reflog`.

I'd report back if the issue occurs again.

--
Su
>>
>> But that was on an older branch (some older misc-next, which 
>> has the
>> patch but still based on v5.16).
>>
>> The v5.17-rc based branch will not boot inside the ARM vm at 
>> all, and no
>> early boot messages after the EFI routine.
>>
>> Will attack the boot problem first.
>>
>> Thanks,
>> Qu
>> >
>> >>
>> >> Thanks,
>> >> Qu
>> >>
>> >>>
>> >>>>
>> >>>> - Add a comment on the existing checks in scrub_chunk()
>> >>>>     This is the ultimate safenet, as it will iterate 
>> >>>>     through all the stripes
>> >>>>     of the found chunk.
>> >>>>     And only scrub the stripe if it matches both device 
>> >>>>     and physical
>> >>>>     offset.
>> >>>>
>> >>>>     So even by some how we got a dev extent which is no 
>> >>>>     longer owned
>> >>>>     by the target chunk, we will just skip this chunk 
>> >>>>     completely, without
>> >>>>     any consequence.
>> >>>>
>> >>>> Reported-by: Su Yue <l@damenly.su>
>> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> >>>> ---
>> >>>> Changelog:
>> >>>> v2:
>> >>>> - Add a new quick exit with extra comments
>> >>>>
>> >>>> - Add a new comment in the existing safenet in 
>> >>>> scrub_chunk()
>> >>>> ---
>> >>>>    fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>> >>>>    1 file changed, 20 insertions(+), 1 deletion(-)
>> >>>>
>> >>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> >>>> index 11089568b287..1c3ed4720ddd 100644
>> >>>> --- a/fs/btrfs/scrub.c
>> >>>> +++ b/fs/btrfs/scrub.c
>> >>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int 
>> >>>> scrub_chunk(struct scrub_ctx *sctx,
>> >>>>               goto out;
>> >>>>
>> >>>>       map = em->map_lookup;
>> >>>> +
>> >>>> +    /*
>> >>>> +     * Due to the delayed modification of device tree, 
>> >>>> this chunk
>> >>>> +     * may not own this dev extent.
>> >>>
>> >>> This is confusing. What delayed modification, what do you 
>> >>> mean by it
>> >>> exactly? Same below, with more details.
>> >>>
>> >>>> +     *
>> >>>> +     * So we need to iterate through all stripes, and 
>> >>>> only scrub
>> >>>> +     * the stripe which matches both device and physical 
>> >>>> offset.
>> >>>> +     */
>> >>>>       for (i = 0; i < map->num_stripes; ++i) {
>> >>>>               if (map->stripes[i].dev->bdev == 
>> >>>>               scrub_dev->bdev &&
>> >>>>                   map->stripes[i].physical == dev_offset) 
>> >>>>                   {
>> >>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct 
>> >>>> scrub_ctx *sctx,
>> >>>>               if (!cache)
>> >>>>                       goto skip;
>> >>>>
>> >>>> +            /*
>> >>>> +             * Since dev extents modification is delayed, 
>> >>>> it's possible
>> >>>> +             * we got a bg which doesn't really own this 
>> >>>> dev extent.
>> >>>
>> >>> Same as before. Too confusing, what is delayed dev extent 
>> >>> modification?
>> >>>
>> >>> Once added, device extents can only be deleted, they are 
>> >>> never modified.
>> >>> I think you are referring to the deletions and the fact we 
>> >>> use the commit
>> >>> roots to find extents, but that should be a more clear 
>> >>> comment, without
>> >>> such level of ambiguity.
>> >>>
>> >>> Thanks.
>> >>>
>> >>>> +             *
>> >>>> +             * Here just do a quick skip, scrub_chunk() 
>> >>>> has a more
>> >>>> +             * comprehensive check in it.
>> >>>> +             */
>> >>>> +            if (cache->start != chunk_offset) {
>> >>>> +                    btrfs_put_block_group(cache);
>> >>>> +                    goto skip;
>> >>>> +            }
>> >>>> +
>> >>>>               if (sctx->is_dev_replace && 
>> >>>>               btrfs_is_zoned(fs_info)) {
>> >>>>                       spin_lock(&cache->lock);
>> >>>>                       if (!cache->to_copy) {
>> >>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct 
>> >>>> scrub_ctx *sctx,
>> >>>>               dev_replace->item_needs_writeback = 1;
>> >>>>               up_write(&dev_replace->rwsem);
>> >>>>
>> >>>> -            ASSERT(cache->start == chunk_offset);
>> >>>>               ret = scrub_chunk(sctx, cache, scrub_dev, 
>> >>>>               found_key.offset,
>> >>>>                                 dev_extent_len);
>> >>>>
>> >>>> --
>> >>>> 2.35.1
>> >>>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
  2022-02-17  9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
  2022-02-17 11:14 ` Filipe Manana
@ 2022-04-19 20:34 ` David Sterba
  1 sibling, 0 replies; 9+ messages in thread
From: David Sterba @ 2022-04-19 20:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Su Yue

On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
> [BUG]
> Su reported that with his VM running on an M1 mac, it has a high chance
> to trigger the following ASSERT() with scrub and balance test cases:
> 
> 		ASSERT(cache->start == chunk_offset);
> 		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> 				  dev_extent_len);
> 
> [CAUSE]
> The ASSERT() is making sure that the block group cache (@cache) and the
> chunk offset from the device extent match.
> 
> But the truth is, block group caches and dev extent items are deleted at
> different timing.
> 
> For block group caches and items, after btrfs_remove_block_group() they
> will be marked deleted, but the device extent and the chunk item is
> still in dev tree and chunk tree respectively.
> 
> The device extents and chunk items are only deleted in
> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
> or btrfs_relocate_chunk().
> 
> This timing difference leaves a window where we can have a mismatch
> between block group cache and device extent tree, and triggering the
> ASSERT().
> 
> [FIX]
> 
> - Remove the ASSERT()
> 
> - Add a quick exit if our found bg doesn't match @chunk_offset
> 
> - Add a comment on the existing checks in scrub_chunk()
>   This is the ultimate safenet, as it will iterate through all the stripes
>   of the found chunk.
>   And only scrub the stripe if it matches both device and physical
>   offset.
> 
>   So even by some how we got a dev extent which is no longer owned
>   by the target chunk, we will just skip this chunk completely, without
>   any consequence.
> 
> Reported-by: Su Yue <l@damenly.su>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Filipe sent a fix that's slightly different, as you haven't sent an
update I'll take his patch.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-04-19 20:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-17  9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana
2022-02-17 11:24   ` Qu Wenruo
2022-02-17 12:32     ` Filipe Manana
2022-02-17 13:00       ` Qu Wenruo
2022-02-17 13:02         ` Filipe Manana
2022-02-17 13:25           ` Qu Wenruo
2022-02-18  2:29           ` Su Yue
2022-04-19 20:34 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.