stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during
@ 2023-04-10  2:22 Qu Wenruo
  2023-04-10  4:20 ` Anand Jain
  2023-04-17 22:22 ` David Sterba
  0 siblings, 2 replies; 7+ messages in thread
From: Qu Wenruo @ 2023-04-10  2:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

This is for pre-6.4 kernels, as scrub code goes through a huge rework.

[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replace and let it
finish just as there is nothing wrong:

 BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
 BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
 BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
 BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
 BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
 BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
 BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished

This can lead to unexpected problems for the result fs.

[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.

But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.

And even if we checks the progress, we can not really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).

This behavior is there from the very beginning.

[FIX]
Instead of continue the replace, just error out if we hit an unrepaired
metadata sector.

Now the dev-replace would be rejected with -EIO, to inform the user.
Although it also means, the fs has some metadata error which can not be
repaired, the user would be super upset anyway.

The new dmesg would look like this:

 BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
 BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
 BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
 BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
 BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
 BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5

CC: stable@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
I'm not sure how should we merge this patch.

The misc-next is already merging the new scrub code, but the problem is
there for all old kernels thus we need such fixes.

Maybe we can merge this fix before the scrub rework, then the rework,
and finally the better fix using reworked interface?
---
 fs/btrfs/scrub.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ef4046a2572c..71f64b9bcd9f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -195,6 +195,7 @@ struct scrub_ctx {
 	struct mutex            wr_lock;
 	struct btrfs_device     *wr_tgtdev;
 	bool                    flush_all_writes;
+	bool			has_meta_failed;
 
 	/*
 	 * statistics
@@ -1380,6 +1381,8 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 		btrfs_err_rl_in_rcu(fs_info,
 			"unable to fixup (regular) error at logical %llu on dev %s",
 			logical, btrfs_dev_name(dev));
+		if (is_metadata)
+			sctx->has_meta_failed = true;
 	}
 
 out:
@@ -3838,6 +3841,12 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 
 	blk_finish_plug(&plug);
 
+	/*
+	 * If we have metadata unable to be repaired, we should error
+	 * out the dev-replace.
+	 */
+	if (sctx->is_dev_replace && sctx->has_meta_failed && ret >= 0)
+		ret = -EIO;
 	if (sctx->is_dev_replace && ret >= 0) {
 		int ret2;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during
  2023-04-10  2:22 [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during Qu Wenruo
@ 2023-04-10  4:20 ` Anand Jain
  2023-04-10  6:42   ` Qu Wenruo
  2023-04-17 22:22 ` David Sterba
  1 sibling, 1 reply; 7+ messages in thread
From: Anand Jain @ 2023-04-10  4:20 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: stable

On 10/4/23 10:22, Qu Wenruo wrote:
> This is for pre-6.4 kernels, as scrub code goes through a huge rework.
> 
> [BUG]
> Even before the scrub rework, if we have some corrupted metadata failed
> to be repaired during replace, we still continue replace and let it
> finish just as there is nothing wrong:
> 
>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>   BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
>   BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
>   BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
>   BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
>   BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
> 
> This can lead to unexpected problems for the result fs.
> 
> [CAUSE]
> Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
> 
> But unlike scrub, dev-replace doesn't really bother to check the scrub
> progress, which records all the errors found during replace.
> 
> And even if we checks the progress, we can not really determine which
> errors are minor, which are critical just by the plain numbers.
> (remember we don't treat metadata/data checksum error differently).
> 
> This behavior is there from the very beginning.
> 
> [FIX]
> Instead of continue the replace, just error out if we hit an unrepaired
> metadata sector.
> 
> Now the dev-replace would be rejected with -EIO, to inform the user.
> Although it also means, the fs has some metadata error which can not be
> repaired, the user would be super upset anyway.

IMO, the original design is fair as it does not capture scrub errors
during the replace. Because the purpose of the scrub is different from
the replace from the user POV.
However, after the replace, if scrubbed it will still capture any
errors? No?

Thanks, Anand


> 
> The new dmesg would look like this:
> 
>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>   BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
>   BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
>   BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
>   BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
>   BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
> 
> CC: stable@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> I'm not sure how should we merge this patch.
> 
> The misc-next is already merging the new scrub code, but the problem is
> there for all old kernels thus we need such fixes.
> 
> Maybe we can merge this fix before the scrub rework, then the rework,
> and finally the better fix using reworked interface?
> ---
>   fs/btrfs/scrub.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ef4046a2572c..71f64b9bcd9f 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -195,6 +195,7 @@ struct scrub_ctx {
>   	struct mutex            wr_lock;
>   	struct btrfs_device     *wr_tgtdev;
>   	bool                    flush_all_writes;
> +	bool			has_meta_failed;
>   
>   	/*
>   	 * statistics
> @@ -1380,6 +1381,8 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>   		btrfs_err_rl_in_rcu(fs_info,
>   			"unable to fixup (regular) error at logical %llu on dev %s",
>   			logical, btrfs_dev_name(dev));
> +		if (is_metadata)
> +			sctx->has_meta_failed = true;
>   	}
>   
>   out:
> @@ -3838,6 +3841,12 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>   
>   	blk_finish_plug(&plug);
>   
> +	/*
> +	 * If we have metadata unable to be repaired, we should error
> +	 * out the dev-replace.
> +	 */
> +	if (sctx->is_dev_replace && sctx->has_meta_failed && ret >= 0)
> +		ret = -EIO;
>   	if (sctx->is_dev_replace && ret >= 0) {
>   		int ret2;
>   


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during
  2023-04-10  4:20 ` Anand Jain
@ 2023-04-10  6:42   ` Qu Wenruo
  2023-04-18 10:40     ` Anand Jain
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2023-04-10  6:42 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs; +Cc: stable



On 2023/4/10 12:20, Anand Jain wrote:
> On 10/4/23 10:22, Qu Wenruo wrote:
>> This is for pre-6.4 kernels, as scrub code goes through a huge rework.
>>
>> [BUG]
>> Even before the scrub rework, if we have some corrupted metadata failed
>> to be repaired during replace, we still continue replace and let it
>> finish just as there is nothing wrong:
>>
>>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 
>> (devid 1) to /dev/mapper/test-scratch2 started
>>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
>> csum, has 0x00000000 want 0xade80ca1
>>   BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad 
>> csum, has 0x00000000 want 0xade80ca1
>>   BTRFS warning (device dm-4): checksum error at logical 5578752 on 
>> dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 
>> 0) in tree 5
>>   BTRFS warning (device dm-4): checksum error at logical 5578752 on 
>> dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 
>> 0) in tree 5
>>   BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 
>> 0, rd 0, flush 0, corrupt 1, gen 0
>>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
>> bytenr, has 0 want 5578752
>>   BTRFS error (device dm-4): unable to fixup (regular) error at 
>> logical 5578752 on dev /dev/mapper/test-scratch1
>>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 
>> (devid 1) to /dev/mapper/test-scratch2 finished
>>
>> This can lead to unexpected problems for the result fs.
>>
>> [CAUSE]
>> Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
>>
>> But unlike scrub, dev-replace doesn't really bother to check the scrub
>> progress, which records all the errors found during replace.
>>
>> And even if we checks the progress, we can not really determine which
>> errors are minor, which are critical just by the plain numbers.
>> (remember we don't treat metadata/data checksum error differently).
>>
>> This behavior is there from the very beginning.
>>
>> [FIX]
>> Instead of continue the replace, just error out if we hit an unrepaired
>> metadata sector.
>>
>> Now the dev-replace would be rejected with -EIO, to inform the user.
>> Although it also means, the fs has some metadata error which can not be
>> repaired, the user would be super upset anyway.
> 
> IMO, the original design is fair as it does not capture scrub errors
> during the replace. Because the purpose of the scrub is different from
> the replace from the user POV.

The problem is, after such replace, the corrupted metadata would have 
different content (we just don't do the writeback at all).
Even worse, the end user is not even aware of the problem, unless dmesg 
is manually checked.

This means we changed the result fs during the replace, which removes 
the tiny chance to do a manual repair (aka, manually re-generate the 
checksum).

> However, after the replace, if scrubbed it will still capture any
> errors? No?

It's not about scrub after scrub. Such replace should not even finish.

Thanks,
Qu
> 
> Thanks, Anand
> 
> 
>>
>> The new dmesg would look like this:
>>
>>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 
>> (devid 1) to /dev/mapper/test-scratch2 started
>>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
>> csum, has 0x00000000 want 0xade80ca1
>>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
>> csum, has 0x00000000 want 0xade80ca1
>>   BTRFS error (device dm-4): unable to fixup (regular) error at 
>> logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
>>   BTRFS warning (device dm-4): header error at logical 5570560 on dev 
>> /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) 
>> in tree 5
>>   BTRFS warning (device dm-4): header error at logical 5570560 on dev 
>> /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) 
>> in tree 5
>>   BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata 
>> sector at 5578752
>>   BTRFS error (device dm-4): 
>> btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, 
>> /dev/mapper/test-scratch2) failed -5
>>
>> CC: stable@vger.kernel.org
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> I'm not sure how should we merge this patch.
>>
>> The misc-next is already merging the new scrub code, but the problem is
>> there for all old kernels thus we need such fixes.
>>
>> Maybe we can merge this fix before the scrub rework, then the rework,
>> and finally the better fix using reworked interface?
>> ---
>>   fs/btrfs/scrub.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index ef4046a2572c..71f64b9bcd9f 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -195,6 +195,7 @@ struct scrub_ctx {
>>       struct mutex            wr_lock;
>>       struct btrfs_device     *wr_tgtdev;
>>       bool                    flush_all_writes;
>> +    bool            has_meta_failed;
>>       /*
>>        * statistics
>> @@ -1380,6 +1381,8 @@ static int scrub_handle_errored_block(struct 
>> scrub_block *sblock_to_check)
>>           btrfs_err_rl_in_rcu(fs_info,
>>               "unable to fixup (regular) error at logical %llu on dev 
>> %s",
>>               logical, btrfs_dev_name(dev));
>> +        if (is_metadata)
>> +            sctx->has_meta_failed = true;
>>       }
>>   out:
>> @@ -3838,6 +3841,12 @@ static noinline_for_stack int 
>> scrub_stripe(struct scrub_ctx *sctx,
>>       blk_finish_plug(&plug);
>> +    /*
>> +     * If we have metadata unable to be repaired, we should error
>> +     * out the dev-replace.
>> +     */
>> +    if (sctx->is_dev_replace && sctx->has_meta_failed && ret >= 0)
>> +        ret = -EIO;
>>       if (sctx->is_dev_replace && ret >= 0) {
>>           int ret2;
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during
  2023-04-10  2:22 [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during Qu Wenruo
  2023-04-10  4:20 ` Anand Jain
@ 2023-04-17 22:22 ` David Sterba
  1 sibling, 0 replies; 7+ messages in thread
From: David Sterba @ 2023-04-17 22:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Mon, Apr 10, 2023 at 10:22:57AM +0800, Qu Wenruo wrote:
> This is for pre-6.4 kernels, as scrub code goes through a huge rework.
> 
> [BUG]
> Even before the scrub rework, if we have some corrupted metadata failed
> to be repaired during replace, we still continue replace and let it
> finish just as there is nothing wrong:
> 
>  BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>  BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>  BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
>  BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
>  BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
>  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>  BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
>  BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
>  BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
> 
> This can lead to unexpected problems for the result fs.
> 
> [CAUSE]
> Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
> 
> But unlike scrub, dev-replace doesn't really bother to check the scrub
> progress, which records all the errors found during replace.
> 
> And even if we checks the progress, we can not really determine which
> errors are minor, which are critical just by the plain numbers.
> (remember we don't treat metadata/data checksum error differently).
> 
> This behavior is there from the very beginning.
> 
> [FIX]
> Instead of continue the replace, just error out if we hit an unrepaired
> metadata sector.
> 
> Now the dev-replace would be rejected with -EIO, to inform the user.
> Although it also means, the fs has some metadata error which can not be
> repaired, the user would be super upset anyway.
> 
> The new dmesg would look like this:
> 
>  BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>  BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>  BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>  BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
>  BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
>  BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
>  BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
>  BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
> 
> CC: stable@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> I'm not sure how should we merge this patch.
> 
> The misc-next is already merging the new scrub code, but the problem is
> there for all old kernels thus we need such fixes.
> 
> Maybe we can merge this fix before the scrub rework, then the rework,
> and finally the better fix using reworked interface?

Rebasing the whole 6.4 queue with the scrub rewrite would be too much
and there's no time left for that before merge window. We'd also need to
retest it after such change.

If we have the fix in master we can do a backport to older stable tree,
in this case it would not be close implementation-wise but the effects
should be the same. Doing two separate fixes will also avoid merge
conflicts.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during
  2023-04-10  6:42   ` Qu Wenruo
@ 2023-04-18 10:40     ` Anand Jain
  0 siblings, 0 replies; 7+ messages in thread
From: Anand Jain @ 2023-04-18 10:40 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs; +Cc: stable

On 10/4/23 14:42, Qu Wenruo wrote:
> 
> 
> On 2023/4/10 12:20, Anand Jain wrote:
>> On 10/4/23 10:22, Qu Wenruo wrote:
>>> This is for pre-6.4 kernels, as scrub code goes through a huge rework.
>>>
>>> [BUG]
>>> Even before the scrub rework, if we have some corrupted metadata failed
>>> to be repaired during replace, we still continue replace and let it
>>> finish just as there is nothing wrong:
>>>
>>>   BTRFS info (device dm-4): dev_replace from 
>>> /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>>>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
>>> csum, has 0x00000000 want 0xade80ca1
>>>   BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad 
>>> csum, has 0x00000000 want 0xade80ca1
>>>   BTRFS warning (device dm-4): checksum error at logical 5578752 on 
>>> dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 
>>> 0) in tree 5
>>>   BTRFS warning (device dm-4): checksum error at logical 5578752 on 
>>> dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 
>>> 0) in tree 5
>>>   BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 
>>> 0, rd 0, flush 0, corrupt 1, gen 0
>>>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
>>> bytenr, has 0 want 5578752
>>>   BTRFS error (device dm-4): unable to fixup (regular) error at 
>>> logical 5578752 on dev /dev/mapper/test-scratch1
>>>   BTRFS info (device dm-4): dev_replace from 
>>> /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 
>>> finished
>>>
>>> This can lead to unexpected problems for the result fs.
>>>
>>> [CAUSE]
>>> Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
>>>
>>> But unlike scrub, dev-replace doesn't really bother to check the scrub
>>> progress, which records all the errors found during replace.
>>>
>>> And even if we checks the progress, we can not really determine which
>>> errors are minor, which are critical just by the plain numbers.
>>> (remember we don't treat metadata/data checksum error differently).
>>>
>>> This behavior is there from the very beginning.
>>>
>>> [FIX]
>>> Instead of continue the replace, just error out if we hit an unrepaired
>>> metadata sector.
>>>
>>> Now the dev-replace would be rejected with -EIO, to inform the user.
>>> Although it also means, the fs has some metadata error which can not be
>>> repaired, the user would be super upset anyway.
>>
>> IMO, the original design is fair as it does not capture scrub errors
>> during the replace. Because the purpose of the scrub is different from
>> the replace from the user POV.
> 
> The problem is, after such replace, the corrupted metadata would have 
> different content (we just don't do the writeback at all).
> Even worse, the end user is not even aware of the problem, unless dmesg 
> is manually checked.
> 
> This means we changed the result fs during the replace, which removes 
> the tiny chance to do a manual repair (aka, manually re-generate the 
> checksum).
My concern is, following this patch, the user will be able to replace
the device only if the filesystem is healthy (passes scrub). But, I got
it, there isn't any other choice.

Thanks, Anand

> 
>> However, after the replace, if scrubbed it will still capture any
>> errors? No?
> 
> It's not about scrub after scrub. Such replace should not even finish.
> 
> Thanks,
> Qu
>>
>> Thanks, Anand
>>
>>
>>>
>>> The new dmesg would look like this:
>>>
>>>   BTRFS info (device dm-4): dev_replace from 
>>> /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>>>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
>>> csum, has 0x00000000 want 0xade80ca1
>>>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad 
>>> csum, has 0x00000000 want 0xade80ca1
>>>   BTRFS error (device dm-4): unable to fixup (regular) error at 
>>> logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
>>>   BTRFS warning (device dm-4): header error at logical 5570560 on dev 
>>> /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) 
>>> in tree 5
>>>   BTRFS warning (device dm-4): header error at logical 5570560 on dev 
>>> /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) 
>>> in tree 5
>>>   BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata 
>>> sector at 5578752
>>>   BTRFS error (device dm-4): 
>>> btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, 
>>> /dev/mapper/test-scratch2) failed -5
>>>
>>> CC: stable@vger.kernel.org
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> I'm not sure how should we merge this patch.
>>>
>>> The misc-next is already merging the new scrub code, but the problem is
>>> there for all old kernels thus we need such fixes.
>>>
>>> Maybe we can merge this fix before the scrub rework, then the rework,
>>> and finally the better fix using reworked interface?
>>> ---
>>>   fs/btrfs/scrub.c | 9 +++++++++
>>>   1 file changed, 9 insertions(+)
>>>
>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>> index ef4046a2572c..71f64b9bcd9f 100644
>>> --- a/fs/btrfs/scrub.c
>>> +++ b/fs/btrfs/scrub.c
>>> @@ -195,6 +195,7 @@ struct scrub_ctx {
>>>       struct mutex            wr_lock;
>>>       struct btrfs_device     *wr_tgtdev;
>>>       bool                    flush_all_writes;
>>> +    bool            has_meta_failed;
>>>       /*
>>>        * statistics
>>> @@ -1380,6 +1381,8 @@ static int scrub_handle_errored_block(struct 
>>> scrub_block *sblock_to_check)
>>>           btrfs_err_rl_in_rcu(fs_info,
>>>               "unable to fixup (regular) error at logical %llu on dev 
>>> %s",
>>>               logical, btrfs_dev_name(dev));
>>> +        if (is_metadata)
>>> +            sctx->has_meta_failed = true;
>>>       }
>>>   out:
>>> @@ -3838,6 +3841,12 @@ static noinline_for_stack int 
>>> scrub_stripe(struct scrub_ctx *sctx,
>>>       blk_finish_plug(&plug);
>>> +    /*
>>> +     * If we have metadata unable to be repaired, we should error
>>> +     * out the dev-replace.
>>> +     */
>>> +    if (sctx->is_dev_replace && sctx->has_meta_failed && ret >= 0)
>>> +        ret = -EIO;
>>>       if (sctx->is_dev_replace && ret >= 0) {
>>>           int ret2;
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during
  2023-04-19 20:23 Torstein Eide
@ 2023-04-19 23:07 ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2023-04-19 23:07 UTC (permalink / raw)
  To: Torstein Eide, anand.jain; +Cc: linux-btrfs, stable, Qu Wenruo



On 2023/4/20 04:23, Torstein Eide wrote:
>> This is for pre-6.4 kernels, as scrub code goes through a huge rework.
>>
>> [BUG]
>> Even before the scrub rework, if we have some corrupted metadata failed
>> to be repaired during replace, we still continue replace and let it
>> finish just as there is nothing wrong:
>>
>>    BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>>    BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>>    BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
>>    BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
>>    BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
>>    BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>>    BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
>>    BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
>>    BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
>>
>> This can lead to unexpected problems for the result fs.
>>
>> [CAUSE]
>> Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
>>
>> But unlike scrub, dev-replace doesn't really bother to check the scrub
>> progress, which records all the errors found during replace.
>>
>> And even if we checks the progress, we can not really determine which
>> errors are minor, which are critical just by the plain numbers.
>> (remember we don't treat metadata/data checksum error differently).
>>
>> This behavior is there from the very beginning.
>>
>> [FIX]
>> Instead of continue the replace, just error out if we hit an unrepaired
>> metadata sector.
>>
>> Now the dev-replace would be rejected with -EIO, to inform the user.
>> Although it also means, the fs has some metadata error which can not be
>> repaired, the user would be super upset anyway.
> 
> If one sector is bad in metadata how much secondary data is damaged?

If it's metadata, it's highly possible the metadata has extra copy by 
default.
(Single disk btrfs goes DUP for meta, while multi-disk one goes RAID1 by 
default).

And if it's a metadata corruption, good luck it's not a critical one, or 
this would be your last time to mount the fs.

> 
> As someone who recently had to remove, and currently replace a disk.
> it is upsetting, if it stopped if we stopped because 0,01% of data is
> unrepairable, if we can save the other 99,99%. Can we have it
> continue, print an error message to standard out, and a way for the
> user to delete or copy it (with som option like -force-delete or
> --force-copy) with the error to the new disk?
> "Metadata at block 5578752 is damaged and unrepaired. Skipping. Read
> `man btrfs-replace` for more info. "
> At the end if possible, list files affected by the damaged metadata blocks.
> 
> In man answer:
> How can the user know what files are connected to the metadata?
> How can a user decide what to do with the damaged metadata?
> 
> 
> At minimum,  can there be some useful info to the info to the error output? like
> "Replace has stopped, due to reading unrepaired metadata block, was
> working on block 5578752, se `dmesg` for more details. (\s Sorry but
> you are currently f..k)"

I'd say, if your metadata is already corrupted, you don't want a 
"generic" solution, but ask for a developer to guide you how to recover.

Thanks,
Qu

> 
> 
> 
>>
>> The new dmesg would look like this:
>>
>>    BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>>    BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>>    BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>>    BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
>>    BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
>>    BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
>>    BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
>>    BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
>>
>> CC: stable@vger.kernel.org
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> I'm not sure how should we merge this patch.
>>
>> The misc-next is already merging the new scrub code, but the problem is
>> there for all old kernels thus we need such fixes.
>>
>> Maybe we can merge this fix before the scrub rework, then the rework,
>> and finally the better fix using reworked interface?
>> ---
>>    fs/btrfs/scrub.c | 9 +++++++++
>>    1 file changed, 9 insertions(+)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index ef4046a2572c..71f64b9bcd9f 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -195,6 +195,7 @@ struct scrub_ctx {
>>    struct mutex            wr_lock;
>>    struct btrfs_device     *wr_tgtdev;
>>    bool                    flush_all_writes;
>> + bool has_meta_failed;
>>
>>    /*
>>    * statistics
>> @@ -1380,6 +1381,8 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>>    btrfs_err_rl_in_rcu(fs_info,
>>    "unable to fixup (regular) error at logical %llu on dev %s",
>>    logical, btrfs_dev_name(dev));
>> + if (is_metadata)
>> + sctx->has_meta_failed = true;
>>    }
>>
>>    out:
>> @@ -3838,6 +3841,12 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>>
>>    blk_finish_plug(&plug);
>>
>> + /*
>> + * If we have metadata unable to be repaired, we should error
>> + * out the dev-replace.
>> + */
>> + if (sctx->is_dev_replace && sctx->has_meta_failed && ret >= 0)
>> + ret = -EIO;
>>    if (sctx->is_dev_replace && ret >= 0) {
>>    int ret2;
>>
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during
@ 2023-04-19 20:23 Torstein Eide
  2023-04-19 23:07 ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Torstein Eide @ 2023-04-19 20:23 UTC (permalink / raw)
  To: anand.jain; +Cc: linux-btrfs, stable, Qu Wenruo

> This is for pre-6.4 kernels, as scrub code goes through a huge rework.
>
> [BUG]
> Even before the scrub rework, if we have some corrupted metadata failed
> to be repaired during replace, we still continue replace and let it
> finish just as there is nothing wrong:
>
>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>   BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
>   BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
>   BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
>   BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
>   BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
>
> This can lead to unexpected problems for the result fs.
>
> [CAUSE]
> Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
>
> But unlike scrub, dev-replace doesn't really bother to check the scrub
> progress, which records all the errors found during replace.
>
> And even if we checks the progress, we can not really determine which
> errors are minor, which are critical just by the plain numbers.
> (remember we don't treat metadata/data checksum error differently).
>
> This behavior is there from the very beginning.
>
> [FIX]
> Instead of continue the replace, just error out if we hit an unrepaired
> metadata sector.
>
> Now the dev-replace would be rejected with -EIO, to inform the user.
> Although it also means, the fs has some metadata error which can not be
> repaired, the user would be super upset anyway.

If one sector is bad in metadata how much secondary data is damaged?

As someone who recently had to remove, and currently replace a disk.
it is upsetting, if it stopped if we stopped because 0,01% of data is
unrepairable, if we can save the other 99,99%. Can we have it
continue, print an error message to standard out, and a way for the
user to delete or copy it (with som option like -force-delete or
--force-copy) with the error to the new disk?
"Metadata at block 5578752 is damaged and unrepaired. Skipping. Read
`man btrfs-replace` for more info. "
At the end if possible, list files affected by the damaged metadata blocks.

In man answer:
How can the user know what files are connected to the metadata?
How can a user decide what to do with the damaged metadata?


At minimum,  can there be some useful info to the info to the error output? like
"Replace has stopped, due to reading unrepaired metadata block, was
working on block 5578752, se `dmesg` for more details. (\s Sorry but
you are currently f..k)"



>
> The new dmesg would look like this:
>
>   BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>   BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
>   BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
>   BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
>   BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
>   BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
>   BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
>
> CC: stable@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> I'm not sure how should we merge this patch.
>
> The misc-next is already merging the new scrub code, but the problem is
> there for all old kernels thus we need such fixes.
>
> Maybe we can merge this fix before the scrub rework, then the rework,
> and finally the better fix using reworked interface?
> ---
>   fs/btrfs/scrub.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ef4046a2572c..71f64b9bcd9f 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -195,6 +195,7 @@ struct scrub_ctx {
>   struct mutex            wr_lock;
>   struct btrfs_device     *wr_tgtdev;
>   bool                    flush_all_writes;
> + bool has_meta_failed;
>
>   /*
>   * statistics
> @@ -1380,6 +1381,8 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>   btrfs_err_rl_in_rcu(fs_info,
>   "unable to fixup (regular) error at logical %llu on dev %s",
>   logical, btrfs_dev_name(dev));
> + if (is_metadata)
> + sctx->has_meta_failed = true;
>   }
>
>   out:
> @@ -3838,6 +3841,12 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>
>   blk_finish_plug(&plug);
>
> + /*
> + * If we have metadata unable to be repaired, we should error
> + * out the dev-replace.
> + */
> + if (sctx->is_dev_replace && sctx->has_meta_failed && ret >= 0)
> + ret = -EIO;
>   if (sctx->is_dev_replace && ret >= 0) {
>   int ret2;
>


-- 
Torstein Eide
Torsteine@gmail.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-04-19 23:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-10  2:22 [PATCH pre-6.4] btrfs: dev-replace: error out if we have unrepaired metadata error during Qu Wenruo
2023-04-10  4:20 ` Anand Jain
2023-04-10  6:42   ` Qu Wenruo
2023-04-18 10:40     ` Anand Jain
2023-04-17 22:22 ` David Sterba
2023-04-19 20:23 Torstein Eide
2023-04-19 23:07 ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).