All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
@ 2020-01-23  7:37 Qu Wenruo
  2020-01-23 12:06 ` Filipe Manana
  2020-01-23 16:40 ` David Sterba
  0 siblings, 2 replies; 17+ messages in thread
From: Qu Wenruo @ 2020-01-23  7:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.

The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.

The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")

[CAUSE]
Dev-replace has two source of writes:
- Write duplication
  All writes to source device will also be duplicated to target device.

  Content:	Latest data/meta

- Scrub copy
  Dev-replace reused scrub code to iterate through existing extents, and
  copy the verified data to target device.

  Content:	Data/meta in commit root

The difference in contents makes the following race possible:
	Regular Writer		|	Dev-replace
-----------------------------------------------------------------
  ^                             |
  | Preallocate one data extent |
  | at bytenr X, len 1M		|
  v				|
  ^ Commit transaction		|
  | Now extent [X, X+1M) is in  |
  v commit root			|
 ================== Dev replace starts =========================
  				| ^
				| | Scrub extent [X, X+1M)
				| | Read [X, X+1M)
				| | (The content are mostly garbage
				| |  since it's preallocated)
  ^				| v
  | Write back happens for	|
  | extent [X, X+512K)		|
  | New data writes to both	|
  | source and target dev.	|
  v				|
				| ^
				| | Scrub writes back extent [X, X+1M)
				| | to target device.
				| | This will over write the new data in
				| | [X, X+512K)
				| v

This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous trans
(in commit root).

This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.

*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
		   -f collapse=0 -f punch=0 -f resvsp=0"
   I didn't expect resvsp ioctl will fallback to fallocate in VFS...

[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().

This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.

ENOSPC for dev-replace is still much better than data corruption.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
RFC->v1:
- Remove the RFC tag
  Since the cause is pinned and verified, no need for RFC.

- Only wait for nocow writes for dev-replace
  CoW writes are safe as they will never overwrite extents in commit
  root.

- Put the wait call into proper lock context
  Previous wait happens after scrub_pause_off(), which can cause
  deadlock where we may need to commit transaction in one of the
  wait calls. But since we are in scrub_pause_off() environment,
  transaction commit will wait us to continue, causing a wait-on-self
  deadlock.
---
 fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 21de630b0730..5aa486cad298 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * This can easily boost the amount of SYSTEM chunks if cleaner
 		 * thread can't be triggered fast enough, and use up all space
 		 * of btrfs_super_block::sys_chunk_array
+		 *
+		 * While for dev replace, we need to try our best to mark block
+		 * group RO, to prevent race between:
+		 * - Write duplication
+		 *   Contains latest data
+		 * - Scrub copy
+		 *   Contains data from commit tree
+		 *
+		 * If target block group is not marked RO, nocow writes can
+		 * be overwritten by scrub copy, causing data corruption.
+		 * So for dev-replace, it's not allowed to continue if a block
+		 * group is not RO.
 		 */
-		ret = btrfs_inc_block_group_ro(cache, false);
-		scrub_pause_off(fs_info);
-
+		ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
 		if (ret == 0) {
 			ro_set = 1;
-		} else if (ret == -ENOSPC) {
+		} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
 			/*
 			 * btrfs_inc_block_group_ro return -ENOSPC when it
 			 * failed in creating new chunk for metadata.
-			 * It is not a problem for scrub/replace, because
+			 * It is not a problem for scrub, because
 			 * metadata are always cowed, and our scrub paused
 			 * commit_transactions.
 			 */
@@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			btrfs_warn(fs_info,
 				   "failed setting block group ro: %d", ret);
 			btrfs_put_block_group(cache);
+			scrub_pause_off(fs_info);
 			break;
 		}
 
+		/*
+		 * Now the target block is marked RO, wait for nocow writes to
+		 * finish before dev-replace.
+		 * COW is fine, as COW never overwrites extents in commit tree.
+		 */
+		if (sctx->is_dev_replace)
+			btrfs_wait_nocow_writers(cache);
+
+		scrub_pause_off(fs_info);
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_right = found_key.offset + length;
 		dev_replace->cursor_left = found_key.offset;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23  7:37 [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace Qu Wenruo
@ 2020-01-23 12:06 ` Filipe Manana
  2020-01-23 12:28   ` Qu WenRuo
  2020-01-23 13:39   ` Qu Wenruo
  2020-01-23 16:40 ` David Sterba
  1 sibling, 2 replies; 17+ messages in thread
From: Filipe Manana @ 2020-01-23 12:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana

On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> looped runs can lead to random failure, where scrub finds csum error.
>
> The possibility is not high, around 1/20 to 1/100, but it's causing data
> corruption.
>
> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> check free space before marking a block group RO")
>
> [CAUSE]
> Dev-replace has two source of writes:
> - Write duplication
>   All writes to source device will also be duplicated to target device.
>
>   Content:      Latest data/meta

I find the term "latest" a bit confusing, perhaps "not yet persisted
data and metadata" is more clear.

>
> - Scrub copy
>   Dev-replace reused scrub code to iterate through existing extents, and
>   copy the verified data to target device.
>
>   Content:      Data/meta in commit root

And so here "previously persisted data and metadata".

>
> The difference in contents makes the following race possible:
>         Regular Writer          |       Dev-replace
> -----------------------------------------------------------------
>   ^                             |
>   | Preallocate one data extent |
>   | at bytenr X, len 1M         |
>   v                             |
>   ^ Commit transaction          |
>   | Now extent [X, X+1M) is in  |
>   v commit root                 |
>  ================== Dev replace starts =========================
>                                 | ^
>                                 | | Scrub extent [X, X+1M)
>                                 | | Read [X, X+1M)
>                                 | | (The content are mostly garbage
>                                 | |  since it's preallocated)
>   ^                             | v
>   | Write back happens for      |
>   | extent [X, X+512K)          |
>   | New data writes to both     |
>   | source and target dev.      |
>   v                             |
>                                 | ^
>                                 | | Scrub writes back extent [X, X+1M)
>                                 | | to target device.
>                                 | | This will over write the new data in
>                                 | | [X, X+512K)
>                                 | v
>
> This race can only happen for nocow writes. Thus metadata and data cow
> writes are safe, as COW will never overwrite extents of previous trans
> (in commit root).
>
> This behavior can be confirmed by disabling all fallocate related calls
> in fsstress (*), then all related tests can pass a 2000 run loop.
>
> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>                    -f collapse=0 -f punch=0 -f resvsp=0"
>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>
> [FIX]
> Make dev-replace to require mandatory block group RO, and wait for current
> nocow writes before calling scrub_chunk().
>
> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> when set_block_ro failed") for dev-replace path.
>
> ENOSPC for dev-replace is still much better than data corruption.

Technically if we flag the block group RO without being able to
persist that due to ENOSPC, it's still ok.
We just want to prevent nocow writes racing with scrub copying
procedure. But that's something for some other time, and this is fine
to me.

>
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> RFC->v1:
> - Remove the RFC tag
>   Since the cause is pinned and verified, no need for RFC.
>
> - Only wait for nocow writes for dev-replace
>   CoW writes are safe as they will never overwrite extents in commit
>   root.
>
> - Put the wait call into proper lock context
>   Previous wait happens after scrub_pause_off(), which can cause
>   deadlock where we may need to commit transaction in one of the
>   wait calls. But since we are in scrub_pause_off() environment,
>   transaction commit will wait us to continue, causing a wait-on-self
>   deadlock.
> ---
>  fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
>  1 file changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 21de630b0730..5aa486cad298 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>                  * thread can't be triggered fast enough, and use up all space
>                  * of btrfs_super_block::sys_chunk_array
> +                *
> +                * While for dev replace, we need to try our best to mark block
> +                * group RO, to prevent race between:
> +                * - Write duplication
> +                *   Contains latest data
> +                * - Scrub copy
> +                *   Contains data from commit tree
> +                *
> +                * If target block group is not marked RO, nocow writes can
> +                * be overwritten by scrub copy, causing data corruption.
> +                * So for dev-replace, it's not allowed to continue if a block
> +                * group is not RO.
>                  */
> -               ret = btrfs_inc_block_group_ro(cache, false);
> -               scrub_pause_off(fs_info);
> -
> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
>                 if (ret == 0) {
>                         ro_set = 1;
> -               } else if (ret == -ENOSPC) {
> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
>                         /*
>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>                          * failed in creating new chunk for metadata.
> -                        * It is not a problem for scrub/replace, because
> +                        * It is not a problem for scrub, because
>                          * metadata are always cowed, and our scrub paused
>                          * commit_transactions.
>                          */
> @@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                         btrfs_warn(fs_info,
>                                    "failed setting block group ro: %d", ret);
>                         btrfs_put_block_group(cache);
> +                       scrub_pause_off(fs_info);
>                         break;
>                 }
>
> +               /*
> +                * Now the target block is marked RO, wait for nocow writes to
> +                * finish before dev-replace.
> +                * COW is fine, as COW never overwrites extents in commit tree.
> +                */
> +               if (sctx->is_dev_replace)
> +                       btrfs_wait_nocow_writers(cache);

So this only waits for nocow ordered extents to be created - it
doesn't wait for them to complete.
After that you still need to call:

btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);

Other than that, looks good to me.

Thanks.

> +
> +               scrub_pause_off(fs_info);
>                 down_write(&dev_replace->rwsem);
>                 dev_replace->cursor_right = found_key.offset + length;
>                 dev_replace->cursor_left = found_key.offset;
> --
> 2.25.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23 12:06 ` Filipe Manana
@ 2020-01-23 12:28   ` Qu WenRuo
  2020-01-23 13:39   ` Qu Wenruo
  1 sibling, 0 replies; 17+ messages in thread
From: Qu WenRuo @ 2020-01-23 12:28 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, Filipe Manana



On 2020/1/23 下午8:06, Filipe Manana wrote:
> On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
>> looped runs can lead to random failure, where scrub finds csum error.
>>
>> The possibility is not high, around 1/20 to 1/100, but it's causing data
>> corruption.
>>
>> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
>> check free space before marking a block group RO")
>>
>> [CAUSE]
>> Dev-replace has two source of writes:
>> - Write duplication
>>   All writes to source device will also be duplicated to target device.
>>
>>   Content:      Latest data/meta
> 
> I find the term "latest" a bit confusing, perhaps "not yet persisted
> data and metadata" is more clear.
> 
>>
>> - Scrub copy
>>   Dev-replace reused scrub code to iterate through existing extents, and
>>   copy the verified data to target device.
>>
>>   Content:      Data/meta in commit root
> 
> And so here "previously persisted data and metadata".
> 
>>
>> The difference in contents makes the following race possible:
>>         Regular Writer          |       Dev-replace
>> -----------------------------------------------------------------
>>   ^                             |
>>   | Preallocate one data extent |
>>   | at bytenr X, len 1M         |
>>   v                             |
>>   ^ Commit transaction          |
>>   | Now extent [X, X+1M) is in  |
>>   v commit root                 |
>>  ================== Dev replace starts =========================
>>                                 | ^
>>                                 | | Scrub extent [X, X+1M)
>>                                 | | Read [X, X+1M)
>>                                 | | (The content are mostly garbage
>>                                 | |  since it's preallocated)
>>   ^                             | v
>>   | Write back happens for      |
>>   | extent [X, X+512K)          |
>>   | New data writes to both     |
>>   | source and target dev.      |
>>   v                             |
>>                                 | ^
>>                                 | | Scrub writes back extent [X, X+1M)
>>                                 | | to target device.
>>                                 | | This will over write the new data in
>>                                 | | [X, X+512K)
>>                                 | v
>>
>> This race can only happen for nocow writes. Thus metadata and data cow
>> writes are safe, as COW will never overwrite extents of previous trans
>> (in commit root).
>>
>> This behavior can be confirmed by disabling all fallocate related calls
>> in fsstress (*), then all related tests can pass a 2000 run loop.
>>
>> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>>                    -f collapse=0 -f punch=0 -f resvsp=0"
>>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>>
>> [FIX]
>> Make dev-replace to require mandatory block group RO, and wait for current
>> nocow writes before calling scrub_chunk().
>>
>> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
>> when set_block_ro failed") for dev-replace path.
>>
>> ENOSPC for dev-replace is still much better than data corruption.
> 
> Technically if we flag the block group RO without being able to
> persist that due to ENOSPC, it's still ok.

Right, I will change the words, since it only slightly increase the
possibility of ENOSPC, not ensured to cause ENOSPC and abort replace.

> We just want to prevent nocow writes racing with scrub copying
> procedure. But that's something for some other time, and this is fine
> to me.
> 
>>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> RFC->v1:
>> - Remove the RFC tag
>>   Since the cause is pinned and verified, no need for RFC.
>>
>> - Only wait for nocow writes for dev-replace
>>   CoW writes are safe as they will never overwrite extents in commit
>>   root.
>>
[...]
>> +               /*
>> +                * Now the target block is marked RO, wait for nocow writes to
>> +                * finish before dev-replace.
>> +                * COW is fine, as COW never overwrites extents in commit tree.
>> +                */
>> +               if (sctx->is_dev_replace)
>> +                       btrfs_wait_nocow_writers(cache);
> 
> So this only waits for nocow ordered extents to be created - it
> doesn't wait for them to complete.
> After that you still need to call:
> 
> btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);

Forgot that, although 1000 runs doesn't expose any problem you are
completely right.

I'll update the patch to address all mentioned comments.

Thanks,
Qu

> 
> Other than that, looks good to me.
> 
> Thanks.
> 
>> +
>> +               scrub_pause_off(fs_info);
>>                 down_write(&dev_replace->rwsem);
>>                 dev_replace->cursor_right = found_key.offset + length;
>>                 dev_replace->cursor_left = found_key.offset;
>> --
>> 2.25.0
>>
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23 12:06 ` Filipe Manana
  2020-01-23 12:28   ` Qu WenRuo
@ 2020-01-23 13:39   ` Qu Wenruo
  2020-01-23 13:49     ` Filipe Manana
  1 sibling, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2020-01-23 13:39 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo; +Cc: linux-btrfs, Filipe Manana


[-- Attachment #1.1: Type: text/plain, Size: 8819 bytes --]



On 2020/1/23 下午8:06, Filipe Manana wrote:
> On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
>> looped runs can lead to random failure, where scrub finds csum error.
>>
>> The possibility is not high, around 1/20 to 1/100, but it's causing data
>> corruption.
>>
>> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
>> check free space before marking a block group RO")
>>
>> [CAUSE]
>> Dev-replace has two source of writes:
>> - Write duplication
>>   All writes to source device will also be duplicated to target device.
>>
>>   Content:      Latest data/meta
> 
> I find the term "latest" a bit confusing, perhaps "not yet persisted
> data and metadata" is more clear.
> 
>>
>> - Scrub copy
>>   Dev-replace reused scrub code to iterate through existing extents, and
>>   copy the verified data to target device.
>>
>>   Content:      Data/meta in commit root
> 
> And so here "previously persisted data and metadata".
> 
>>
>> The difference in contents makes the following race possible:
>>         Regular Writer          |       Dev-replace
>> -----------------------------------------------------------------
>>   ^                             |
>>   | Preallocate one data extent |
>>   | at bytenr X, len 1M         |
>>   v                             |
>>   ^ Commit transaction          |
>>   | Now extent [X, X+1M) is in  |
>>   v commit root                 |
>>  ================== Dev replace starts =========================
>>                                 | ^
>>                                 | | Scrub extent [X, X+1M)
>>                                 | | Read [X, X+1M)
>>                                 | | (The content are mostly garbage
>>                                 | |  since it's preallocated)
>>   ^                             | v
>>   | Write back happens for      |
>>   | extent [X, X+512K)          |
>>   | New data writes to both     |
>>   | source and target dev.      |
>>   v                             |
>>                                 | ^
>>                                 | | Scrub writes back extent [X, X+1M)
>>                                 | | to target device.
>>                                 | | This will over write the new data in
>>                                 | | [X, X+512K)
>>                                 | v
>>
>> This race can only happen for nocow writes. Thus metadata and data cow
>> writes are safe, as COW will never overwrite extents of previous trans
>> (in commit root).
>>
>> This behavior can be confirmed by disabling all fallocate related calls
>> in fsstress (*), then all related tests can pass a 2000 run loop.
>>
>> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>>                    -f collapse=0 -f punch=0 -f resvsp=0"
>>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>>
>> [FIX]
>> Make dev-replace to require mandatory block group RO, and wait for current
>> nocow writes before calling scrub_chunk().
>>
>> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
>> when set_block_ro failed") for dev-replace path.
>>
>> ENOSPC for dev-replace is still much better than data corruption.
> 
> Technically if we flag the block group RO without being able to
> persist that due to ENOSPC, it's still ok.
> We just want to prevent nocow writes racing with scrub copying
> procedure. But that's something for some other time, and this is fine
> to me.
> 
>>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> RFC->v1:
>> - Remove the RFC tag
>>   Since the cause is pinned and verified, no need for RFC.
>>
>> - Only wait for nocow writes for dev-replace
>>   CoW writes are safe as they will never overwrite extents in commit
>>   root.
>>
>> - Put the wait call into proper lock context
>>   Previous wait happens after scrub_pause_off(), which can cause
>>   deadlock where we may need to commit transaction in one of the
>>   wait calls. But since we are in scrub_pause_off() environment,
>>   transaction commit will wait us to continue, causing a wait-on-self
>>   deadlock.
>> ---
>>  fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
>>  1 file changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index 21de630b0730..5aa486cad298 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>>                  * thread can't be triggered fast enough, and use up all space
>>                  * of btrfs_super_block::sys_chunk_array
>> +                *
>> +                * While for dev replace, we need to try our best to mark block
>> +                * group RO, to prevent race between:
>> +                * - Write duplication
>> +                *   Contains latest data
>> +                * - Scrub copy
>> +                *   Contains data from commit tree
>> +                *
>> +                * If target block group is not marked RO, nocow writes can
>> +                * be overwritten by scrub copy, causing data corruption.
>> +                * So for dev-replace, it's not allowed to continue if a block
>> +                * group is not RO.
>>                  */
>> -               ret = btrfs_inc_block_group_ro(cache, false);
>> -               scrub_pause_off(fs_info);
>> -
>> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
>>                 if (ret == 0) {
>>                         ro_set = 1;
>> -               } else if (ret == -ENOSPC) {
>> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
>>                         /*
>>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>>                          * failed in creating new chunk for metadata.
>> -                        * It is not a problem for scrub/replace, because
>> +                        * It is not a problem for scrub, because
>>                          * metadata are always cowed, and our scrub paused
>>                          * commit_transactions.
>>                          */
>> @@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>                         btrfs_warn(fs_info,
>>                                    "failed setting block group ro: %d", ret);
>>                         btrfs_put_block_group(cache);
>> +                       scrub_pause_off(fs_info);
>>                         break;
>>                 }
>>
>> +               /*
>> +                * Now the target block is marked RO, wait for nocow writes to
>> +                * finish before dev-replace.
>> +                * COW is fine, as COW never overwrites extents in commit tree.
>> +                */
>> +               if (sctx->is_dev_replace)
>> +                       btrfs_wait_nocow_writers(cache);
> 
> So this only waits for nocow ordered extents to be created - it
> doesn't wait for them to complete.

Wait for minute.

This btrfs_wait_nocow_writers() is not just triggering ordered extents
for nocow.
It waits for the nocow_writers count decreased to 0 for that block group.

Since we have already marked the block group RO, no new nocow writers
can happen, thus that counter can only decrease, no way to increase.

There are several cases involved:
- NoCOW Write back happens before bg RO
  It will increase cache->nocow_writers counter, and decrease the
  counter after finish_oredered_io().
  btrfs_wait_nocow_writers() will wait for such writes

- Writeback happens after bg RO
  Then the write will fallback to COW.

- Writeback happens after bg RO cleared (reverted back to RW)
  No need to bother at all.

Thus this should be enough, no need for btrfs_wait_ordered_roots().

Thanks,
Qu


> After that you still need to call:
> 
> btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
> 
> Other than that, looks good to me.
> 
> Thanks.
> 
>> +
>> +               scrub_pause_off(fs_info);
>>                 down_write(&dev_replace->rwsem);
>>                 dev_replace->cursor_right = found_key.offset + length;
>>                 dev_replace->cursor_left = found_key.offset;
>> --
>> 2.25.0
>>
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23 13:39   ` Qu Wenruo
@ 2020-01-23 13:49     ` Filipe Manana
  2020-01-23 13:57       ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2020-01-23 13:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Filipe Manana

On Thu, Jan 23, 2020 at 1:39 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/1/23 下午8:06, Filipe Manana wrote:
> > On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [BUG]
> >> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> >> looped runs can lead to random failure, where scrub finds csum error.
> >>
> >> The possibility is not high, around 1/20 to 1/100, but it's causing data
> >> corruption.
> >>
> >> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> >> check free space before marking a block group RO")
> >>
> >> [CAUSE]
> >> Dev-replace has two source of writes:
> >> - Write duplication
> >>   All writes to source device will also be duplicated to target device.
> >>
> >>   Content:      Latest data/meta
> >
> > I find the term "latest" a bit confusing, perhaps "not yet persisted
> > data and metadata" is more clear.
> >
> >>
> >> - Scrub copy
> >>   Dev-replace reused scrub code to iterate through existing extents, and
> >>   copy the verified data to target device.
> >>
> >>   Content:      Data/meta in commit root
> >
> > And so here "previously persisted data and metadata".
> >
> >>
> >> The difference in contents makes the following race possible:
> >>         Regular Writer          |       Dev-replace
> >> -----------------------------------------------------------------
> >>   ^                             |
> >>   | Preallocate one data extent |
> >>   | at bytenr X, len 1M         |
> >>   v                             |
> >>   ^ Commit transaction          |
> >>   | Now extent [X, X+1M) is in  |
> >>   v commit root                 |
> >>  ================== Dev replace starts =========================
> >>                                 | ^
> >>                                 | | Scrub extent [X, X+1M)
> >>                                 | | Read [X, X+1M)
> >>                                 | | (The content are mostly garbage
> >>                                 | |  since it's preallocated)
> >>   ^                             | v
> >>   | Write back happens for      |
> >>   | extent [X, X+512K)          |
> >>   | New data writes to both     |
> >>   | source and target dev.      |
> >>   v                             |
> >>                                 | ^
> >>                                 | | Scrub writes back extent [X, X+1M)
> >>                                 | | to target device.
> >>                                 | | This will over write the new data in
> >>                                 | | [X, X+512K)
> >>                                 | v
> >>
> >> This race can only happen for nocow writes. Thus metadata and data cow
> >> writes are safe, as COW will never overwrite extents of previous trans
> >> (in commit root).
> >>
> >> This behavior can be confirmed by disabling all fallocate related calls
> >> in fsstress (*), then all related tests can pass a 2000 run loop.
> >>
> >> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
> >>                    -f collapse=0 -f punch=0 -f resvsp=0"
> >>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
> >>
> >> [FIX]
> >> Make dev-replace to require mandatory block group RO, and wait for current
> >> nocow writes before calling scrub_chunk().
> >>
> >> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> >> when set_block_ro failed") for dev-replace path.
> >>
> >> ENOSPC for dev-replace is still much better than data corruption.
> >
> > Technically if we flag the block group RO without being able to
> > persist that due to ENOSPC, it's still ok.
> > We just want to prevent nocow writes racing with scrub copying
> > procedure. But that's something for some other time, and this is fine
> > to me.
> >
> >>
> >> Reported-by: Filipe Manana <fdmanana@suse.com>
> >> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> >> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Changelog:
> >> RFC->v1:
> >> - Remove the RFC tag
> >>   Since the cause is pinned and verified, no need for RFC.
> >>
> >> - Only wait for nocow writes for dev-replace
> >>   CoW writes are safe as they will never overwrite extents in commit
> >>   root.
> >>
> >> - Put the wait call into proper lock context
> >>   Previous wait happens after scrub_pause_off(), which can cause
> >>   deadlock where we may need to commit transaction in one of the
> >>   wait calls. But since we are in scrub_pause_off() environment,
> >>   transaction commit will wait us to continue, causing a wait-on-self
> >>   deadlock.
> >> ---
> >>  fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
> >>  1 file changed, 25 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >> index 21de630b0730..5aa486cad298 100644
> >> --- a/fs/btrfs/scrub.c
> >> +++ b/fs/btrfs/scrub.c
> >> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>                  * This can easily boost the amount of SYSTEM chunks if cleaner
> >>                  * thread can't be triggered fast enough, and use up all space
> >>                  * of btrfs_super_block::sys_chunk_array
> >> +                *
> >> +                * While for dev replace, we need to try our best to mark block
> >> +                * group RO, to prevent race between:
> >> +                * - Write duplication
> >> +                *   Contains latest data
> >> +                * - Scrub copy
> >> +                *   Contains data from commit tree
> >> +                *
> >> +                * If target block group is not marked RO, nocow writes can
> >> +                * be overwritten by scrub copy, causing data corruption.
> >> +                * So for dev-replace, it's not allowed to continue if a block
> >> +                * group is not RO.
> >>                  */
> >> -               ret = btrfs_inc_block_group_ro(cache, false);
> >> -               scrub_pause_off(fs_info);
> >> -
> >> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
> >>                 if (ret == 0) {
> >>                         ro_set = 1;
> >> -               } else if (ret == -ENOSPC) {
> >> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
> >>                         /*
> >>                          * btrfs_inc_block_group_ro return -ENOSPC when it
> >>                          * failed in creating new chunk for metadata.
> >> -                        * It is not a problem for scrub/replace, because
> >> +                        * It is not a problem for scrub, because
> >>                          * metadata are always cowed, and our scrub paused
> >>                          * commit_transactions.
> >>                          */
> >> @@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>                         btrfs_warn(fs_info,
> >>                                    "failed setting block group ro: %d", ret);
> >>                         btrfs_put_block_group(cache);
> >> +                       scrub_pause_off(fs_info);
> >>                         break;
> >>                 }
> >>
> >> +               /*
> >> +                * Now the target block is marked RO, wait for nocow writes to
> >> +                * finish before dev-replace.
> >> +                * COW is fine, as COW never overwrites extents in commit tree.
> >> +                */
> >> +               if (sctx->is_dev_replace)
> >> +                       btrfs_wait_nocow_writers(cache);
> >
> > So this only waits for nocow ordered extents to be created - it
> > doesn't wait for them to complete.
>
> Wait for minute.
>
> This btrfs_wait_nocow_writers() is not just triggering ordered extents
> for nocow.
> It waits for the nocow_writers count decreased to 0 for that block group.
>
> Since we have already marked the block group RO, no new nocow writers
> can happen, thus that counter can only decrease, no way to increase.
>
> There are several cases involved:
> - NoCOW Write back happens before bg RO
>   It will increase cache->nocow_writers counter, and decrease the
>   counter after finish_oredered_io().

Nop. nocow_writers is decremented after creating the ordered extent
when starting writeback (at run_delalloc_nocow) - not when completing
the ordered extent (at btrfs_finish_ordered_io()).
Same applies direct IO.

Thanks


>   btrfs_wait_nocow_writers() will wait for such writes
>
> - Writeback happens after bg RO
>   Then the write will fallback to COW.
>
> - Writeback happens after bg RO cleared (reverted back to RW)
>   No need to bother at all.
>
> Thus this should be enough, no need for btrfs_wait_ordered_roots().
>
> Thanks,
> Qu
>
>
> > After that you still need to call:
> >
> > btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
> >
> > Other than that, looks good to me.
> >
> > Thanks.
> >
> >> +
> >> +               scrub_pause_off(fs_info);
> >>                 down_write(&dev_replace->rwsem);
> >>                 dev_replace->cursor_right = found_key.offset + length;
> >>                 dev_replace->cursor_left = found_key.offset;
> >> --
> >> 2.25.0
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23 13:49     ` Filipe Manana
@ 2020-01-23 13:57       ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2020-01-23 13:57 UTC (permalink / raw)
  To: fdmanana; +Cc: Qu Wenruo, linux-btrfs, Filipe Manana


[-- Attachment #1.1: Type: text/plain, Size: 977 bytes --]



On 2020/1/23 下午9:49, Filipe Manana wrote:
[...]
>> This btrfs_wait_nocow_writers() is not just triggering ordered extents
>> for nocow.
>> It waits for the nocow_writers count decreased to 0 for that block group.
>>
>> Since we have already marked the block group RO, no new nocow writers
>> can happen, thus that counter can only decrease, no way to increase.
>>
>> There are several cases involved:
>> - NoCOW Write back happens before bg RO
>>   It will increase cache->nocow_writers counter, and decrease the
>>   counter after finish_oredered_io().
> 
> Nop. nocow_writers is decremented after creating the ordered extent
> when starting writeback (at run_delalloc_nocow) - not when completing
> the ordered extent (at btrfs_finish_ordered_io()).

Oh, right. The dec part still happens in run_delalloc_nocow().
So the wait_ordered_roots() call is still needed.

Thanks,
Qu

> Same applies direct IO.
> 
> Thanks
> 

>>>
>>
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23  7:37 [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace Qu Wenruo
  2020-01-23 12:06 ` Filipe Manana
@ 2020-01-23 16:40 ` David Sterba
  2020-01-23 23:58   ` Qu Wenruo
  1 sibling, 1 reply; 17+ messages in thread
From: David Sterba @ 2020-01-23 16:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana

On Thu, Jan 23, 2020 at 03:37:59PM +0800, Qu Wenruo wrote:
> 
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")

This one is in the 5.5-rc, so I'd like to get it to the final release. I
haven't read the discussion properly so please let me know if the patch
needs another round or fixups I can do. Time for the pull request is in
a day (2 at most as it's too close to the release) but given the type of
fix it's justified. Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23 16:40 ` David Sterba
@ 2020-01-23 23:58   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2020-01-23 23:58 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Filipe Manana


[-- Attachment #1.1: Type: text/plain, Size: 753 bytes --]



On 2020/1/24 上午12:40, David Sterba wrote:
> On Thu, Jan 23, 2020 at 03:37:59PM +0800, Qu Wenruo wrote:
>>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> 
> This one is in the 5.5-rc, so I'd like to get it to the final release. I
> haven't read the discussion properly so please let me know if the patch
> needs another round or fixups I can do. Time for the pull request is in
> a day (2 at most as it's too close to the release) but given the type of
> fix it's justified. Thanks.
> 
The v2 version has Cced to you (and the mail list).

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-25 11:35       ` David Sterba
@ 2020-01-25 12:09         ` Qu WenRuo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu WenRuo @ 2020-01-25 12:09 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, fdmanana, linux-btrfs, Filipe Manana



On 2020/1/25 下午7:35, David Sterba wrote:
> On Sat, Jan 25, 2020 at 08:36:14AM +0800, Qu Wenruo wrote:
>>>> The purpose seems to be to catch generic error codes other than
>>>> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
>>>> that case. But it' a new ENOSPC problem, likely caused by the
>>>> read-only status switching.
>>>>
>>>> My test devices are 12G, there's full log of the test to see at which
>>>> phase it happened.
>>>
>>> It passes for me on 20G devices, haven't tested with 12G however
>>> (can't afford to reboot any of my vms now).
>>
>> 5G for all scratch devices, and failed to reproduce it.
>> (The full run before submitting the patch also failed to reproduce it)
> 
> 5G is not actually enough to run some of the tests that require at least
> 10G of free space (so the block device needs to be a bit larger).

BTW, that ENOSPC sounds like a metadata over-commit bug, there if we can
allocate 2 or more meta chunks, and meta rsv already needs 2 new chunks,
then btrfs_inc_block_group_ro() will always fail no matter whatever the
parameter we're using for any metadata chunk.
(Since btrfs_block_group_ro() can only pre-alloc one chunk, while our
current meta over-commit allows way more meta rsv than just one chunk)

That one can be solved by my previous per-profile avail space patchset.

> 
>>> I think that happens because before this patch we ignored ENOSPC
>>> errors, when trying to set a block group to RO, for device replace and
>>> scrub.
>>> But with this patch we don't ignore ENOSPC for device replace anymore
>>> - this is actually correct because if we ignore we can corrupt nocow
>>> writes (including writes into prealloc extents).
>>>
>>> Now if it's a real ENOSPC situation or just a bug in the space
>>> management code, it's a different thing to look at.
>>
>> I tend to take a middle land of the problem.
>>
>> For current stage, the WARN_ON() is indeed overkilled, at least for ENOSPC.
>>
>> But on the other handle, the full RO of a block group for scrub/replace
>> is also a little overkilled.
>> Just as Filipe mentioned, we only want to kill nocow writes into a block
>> group, but still allow COW writes.
>>
>> It looks like something like mark_block_group_nocow_ro() in the future
>> could reduce the possibility if not fully kill it.
> 
> Yeah this sounds doable.

But it may need more consideration since the rw -> nocow_ro -> ro state
machine can be more complex than my initial thought.

> 
>> It looks like changing the WARN_ON(ret) to WARN_ON(ret != -ENOSPC) would
>> be needed for this patch as a quick fix.
> 
> I'll remove the warning completely, as a separate patch.

That's great.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-25  0:36     ` Qu Wenruo
  2020-01-25  0:48       ` Qu Wenruo
@ 2020-01-25 11:35       ` David Sterba
  2020-01-25 12:09         ` Qu WenRuo
  1 sibling, 1 reply; 17+ messages in thread
From: David Sterba @ 2020-01-25 11:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: fdmanana, dsterba, Qu Wenruo, linux-btrfs, Filipe Manana

On Sat, Jan 25, 2020 at 08:36:14AM +0800, Qu Wenruo wrote:
> >> The purpose seems to be to catch generic error codes other than
> >> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
> >> that case. But it' a new ENOSPC problem, likely caused by the
> >> read-only status switching.
> >>
> >> My test devices are 12G, there's full log of the test to see at which
> >> phase it happened.
> > 
> > It passes for me on 20G devices, haven't tested with 12G however
> > (can't afford to reboot any of my vms now).
> 
> 5G for all scratch devices, and failed to reproduce it.
> (The full run before submitting the patch also failed to reproduce it)

5G is not actually enough to run some of the tests that require at least
10G of free space (so the block device needs to be a bit larger).

> > I think that happens because before this patch we ignored ENOSPC
> > errors, when trying to set a block group to RO, for device replace and
> > scrub.
> > But with this patch we don't ignore ENOSPC for device replace anymore
> > - this is actually correct because if we ignore we can corrupt nocow
> > writes (including writes into prealloc extents).
> > 
> > Now if it's a real ENOSPC situation or just a bug in the space
> > management code, it's a different thing to look at.
> 
> I tend to take a middle land of the problem.
> 
> For current stage, the WARN_ON() is indeed overkilled, at least for ENOSPC.
> 
> But on the other handle, the full RO of a block group for scrub/replace
> is also a little overkilled.
> Just as Filipe mentioned, we only want to kill nocow writes into a block
> group, but still allow COW writes.
> 
> It looks like something like mark_block_group_nocow_ro() in the future
> could reduce the possibility if not fully kill it.

Yeah this sounds doable.

> It looks like changing the WARN_ON(ret) to WARN_ON(ret != -ENOSPC) would
> be needed for this patch as a quick fix.

I'll remove the warning completely, as a separate patch.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-25  0:36     ` Qu Wenruo
@ 2020-01-25  0:48       ` Qu Wenruo
  2020-01-25 11:35       ` David Sterba
  1 sibling, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2020-01-25  0:48 UTC (permalink / raw)
  To: fdmanana, dsterba, Qu Wenruo, linux-btrfs, Filipe Manana


[-- Attachment #1.1: Type: text/plain, Size: 14995 bytes --]



On 2020/1/25 上午8:36, Qu Wenruo wrote:
> 
> 
> On 2020/1/25 上午12:28, Filipe Manana wrote:
>> On Fri, Jan 24, 2020 at 3:12 PM David Sterba <dsterba@suse.cz> wrote:
>>>
>>> On current master branch (4703d9119972bf58) with this patch btrfs/011
>>> prints a warning from this code:
>>>
>>>  502         ret = btrfs_dev_replace_finishing(fs_info, ret);
>>>  503         if (ret == -EINPROGRESS) {
>>>  504                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
>>>  505         } else if (ret != -ECANCELED) {
>>>  506                 WARN_ON(ret);
>>>  507         }
>>>
>>> with ret == -ENOSPC
>>>
>>> The purpose seems to be to catch generic error codes other than
>>> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
>>> that case. But it' a new ENOSPC problem, likely caused by the
>>> read-only status switching.
>>>
>>> My test devices are 12G, there's full log of the test to see at which
>>> phase it happened.
>>
>> It passes for me on 20G devices, haven't tested with 12G however
>> (can't afford to reboot any of my vms now).
> 
> 5G for all scratch devices, and failed to reproduce it.
> (The full run before submitting the patch also failed to reproduce it)
> 
>>
>> I think that happens because before this patch we ignored ENOSPC
>> errors, when trying to set a block group to RO, for device replace and
>> scrub.
>> But with this patch we don't ignore ENOSPC for device replace anymore
>> - this is actually correct because if we ignore we can corrupt nocow
>> writes (including writes into prealloc extents).
>>
>> Now if it's a real ENOSPC situation or just a bug in the space
>> management code, it's a different thing to look at.
> 
> I tend to take a middle land of the problem.
> 
> For current stage, the WARN_ON() is indeed overkilled, at least for ENOSPC.
> 
> But on the other handle, the full RO of a block group for scrub/replace
> is also a little overkilled.
> Just as Filipe mentioned, we only want to kill nocow writes into a block
> group, but still allow COW writes.
> 
> It looks like something like mark_block_group_nocow_ro() in the future
> could reduce the possibility if not fully kill it.
> 
> 
> It looks like changing the WARN_ON(ret) to WARN_ON(ret != -ENOSPC) would
> be needed for this patch as a quick fix.

Well, it should be WARN_ON(ret && ret != -ENOSPC);

> 
> Thanks,
> Qu
> 
>>
>> Thanks.
>>
>>>
>>> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
>>> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
>>> [ 1892.059993] ------------[ cut here ]------------
>>> [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
>>> [ 1892.068628] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
>>> [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
>>> [ 1892.076559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
>>> [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
>>> [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
>>> [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
>>> [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
>>> [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
>>> [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
>>> [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
>>> [ 1892.108642] FS:  00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
>>> [ 1892.111558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
>>> [ 1892.115814] Call Trace:
>>> [ 1892.116896]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
>>> [ 1892.118962]  btrfs_ioctl+0x1d62/0x2550 [btrfs]
>>> [ 1892.120633]  ? __lock_acquire+0x272/0x1320
>>> [ 1892.122177]  ? kvm_sched_clock_read+0x14/0x30
>>> [ 1892.123629]  ? do_sigaction+0xf8/0x240
>>> [ 1892.124919]  ? kvm_sched_clock_read+0x14/0x30
>>> [ 1892.126418]  ? sched_clock+0x5/0x10
>>> [ 1892.127534]  ? sched_clock_cpu+0x10/0x120
>>> [ 1892.129096]  ? do_sigaction+0xf8/0x240
>>> [ 1892.130525]  ? do_vfs_ioctl+0x56e/0x770
>>> [ 1892.131818]  do_vfs_ioctl+0x56e/0x770
>>> [ 1892.133012]  ? do_sigaction+0xf8/0x240
>>> [ 1892.134228]  ksys_ioctl+0x3a/0x70
>>> [ 1892.135447]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>>> [ 1892.137148]  __x64_sys_ioctl+0x16/0x20
>>> [ 1892.138702]  do_syscall_64+0x50/0x210
>>> [ 1892.140095]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> [ 1892.141911] RIP: 0033:0x7fcaed61e387
>>> [ 1892.149247] RSP: 002b:00007ffcb47fc2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [ 1892.151839] RAX: ffffffffffffffda RBX: 00007ffcb47ff12d RCX: 00007fcaed61e387
>>> [ 1892.154060] RDX: 00007ffcb47fd160 RSI: 00000000ca289435 RDI: 0000000000000003
>>> [ 1892.156191] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
>>> [ 1892.158317] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000003
>>> [ 1892.160564] R13: 0000000000000001 R14: 0000000000000000 R15: 00005624ff1602e0
>>> [ 1892.162736] irq event stamp: 244626
>>> [ 1892.164191] hardirqs last  enabled at (244625): [<ffffffff9178860e>] _raw_spin_unlock_irqrestore+0x3e/0x50
>>> [ 1892.167531] hardirqs last disabled at (244626): [<ffffffff91002aeb>] trace_hardirqs_off_thunk+0x1a/0x1c
>>> [ 1892.170875] softirqs last  enabled at (244482): [<ffffffff91a00358>] __do_softirq+0x358/0x52b
>>> [ 1892.174005] softirqs last disabled at (244471): [<ffffffff9108ac1d>] irq_exit+0x9d/0xb0
>>> [ 1892.176555] ---[ end trace d72b653b61eb7d20 ]---
>>> [failed, exit status 1] [14:22:59]- output mismatch (see /tmp/fstests/results//btrfs/011.out.bad)
>>>     --- tests/btrfs/011.out     2018-04-12 16:57:00.608225550 +0000
>>>     +++ /tmp/fstests/results//btrfs/011.out.bad 2020-01-24 14:22:59.248000000 +0000
>>>     @@ -1,3 +1,4 @@
>>>      QA output created by 011
>>>      *** test btrfs replace
>>>     -*** done
>>>     +failed: '/sbin/btrfs replace start -Bf -r /dev/vdd /dev/vdb /tmp/scratch'
>>>     +(see /tmp/fstests/results//btrfs/011.full for details)
>>>     ...
>>>     (Run 'diff -u /tmp/fstests/tests/btrfs/011.out /tmp/fstests/results//btrfs/011.out.bad'  to see the entire diff)
>>>
>>> btrfs/011               [13:59:50][  504.570047] run fstests btrfs/011 at 2020-01-24 13:59:50
>>> [  505.710501] BTRFS: device fsid cd48459b-2332-425b-9d4e-324021eb0f2a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20824)
>>> [  505.740305] BTRFS info (device vdb): turning on discard
>>> [  505.742954] BTRFS info (device vdb): disk space caching is enabled
>>> [  505.745007] BTRFS info (device vdb): has skinny extents
>>> [  505.747096] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  505.755093] BTRFS info (device vdb): checking UUID tree
>>> [  521.548385] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  525.294200] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>>> [  526.798086] BTRFS info (device vdb): scrub: started on devid 1
>>> [  528.104425] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  528.393736] BTRFS info (device vdc): turning on discard
>>> [  528.396009] BTRFS info (device vdc): disk space caching is enabled
>>> [  528.398144] BTRFS info (device vdc): has skinny extents
>>> [  528.597381] BTRFS: device fsid 20fd7216-ce75-4761-bb61-a5819aef05b6 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (27027)
>>> [  528.626562] BTRFS info (device vdb): turning on discard
>>> [  528.629852] BTRFS info (device vdb): disk space caching is enabled
>>> [  528.633442] BTRFS info (device vdb): has skinny extents
>>> [  528.640914] BTRFS info (device vdb): checking UUID tree
>>> [  548.785255] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  551.256550] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>>> [  557.620409] BTRFS info (device vdb): scrub: started on devid 1
>>> [  559.258036] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  559.646237] BTRFS info (device vdc): turning on discard
>>> [  559.649237] BTRFS info (device vdc): disk space caching is enabled
>>> [  559.652122] BTRFS info (device vdc): has skinny extents
>>> [  561.076431] BTRFS: device fsid e8c0f848-fd83-4401-95ff-503d3bba3c08 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (750)
>>> [  561.103870] BTRFS info (device vdb): turning on discard
>>> [  561.106896] BTRFS info (device vdb): disk space caching is enabled
>>> [  561.110432] BTRFS info (device vdb): has skinny extents
>>> [  561.113446] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  561.121762] BTRFS info (device vdb): checking UUID tree
>>> [  576.690406] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  581.249317] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>>> [  584.252317] BTRFS info (device vdb): scrub: started on devid 1
>>> [  585.192576] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  586.225905] BTRFS info (device vdc): turning on discard
>>> [  586.229054] BTRFS info (device vdc): disk space caching is enabled
>>> [  586.231819] BTRFS info (device vdc): has skinny extents
>>> [  586.599308] BTRFS: device fsid dd789d35-4d2f-428c-aa03-88ffd5a734cb devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (6931)
>>> [  586.628325] BTRFS info (device vdb): turning on discard
>>> [  586.630954] BTRFS info (device vdb): disk space caching is enabled
>>> [  586.634146] BTRFS info (device vdb): has skinny extents
>>> [  586.636859] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  586.645020] BTRFS info (device vdb): checking UUID tree
>>> [  608.342513] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  611.869495] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc canceled
>>> [  646.226962] BTRFS info (device vdb): scrub: started on devid 1
>>> [  652.537808] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  654.148332] BTRFS info (device vdb): turning on discard
>>> [  654.152563] BTRFS info (device vdb): disk space caching is enabled
>>> [  654.157499] BTRFS info (device vdb): has skinny extents
>>> [  655.318709] BTRFS: device fsid cc2a56c5-5371-4b14-88af-527515683491 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (14080)
>>> [  655.341656] BTRFS info (device vdb): turning on discard
>>> [  655.344075] BTRFS info (device vdb): disk space caching is enabled
>>> [  655.346723] BTRFS info (device vdb): has skinny extents
>>> [  655.355368] BTRFS info (device vdb): checking UUID tree
>>> [  688.140556] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  696.723022] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>>> [  707.940962] BTRFS info (device vdb): scrub: started on devid 1
>>> [  711.456414] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  719.263299] BTRFS info (device vdc): turning on discard
>>> [  719.266339] BTRFS info (device vdc): disk space caching is enabled
>>> [  719.269694] BTRFS info (device vdc): has skinny extents
>>> [  722.816437] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20268)
>>> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
>>> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
>>> [  722.850081] BTRFS info (device vdb): turning on discard
>>> [  722.853321] BTRFS info (device vdb): disk space caching is enabled
>>> [  722.855710] BTRFS info (device vdb): has skinny extents
>>> [  722.857894] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  722.866879] BTRFS info (device vdb): checking UUID tree
>>> [  738.622392] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>>> [  739.728325] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
>>> [  740.017266] BTRFS info (device vdb): scrub: started on devid 1
>>> [  740.017277] BTRFS info (device vdb): scrub: started on devid 2
>>> [  740.714789] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
>>> [  740.714965] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  741.234158] BTRFS info (device vdd): turning on discard
>>> [  741.237294] BTRFS info (device vdd): disk space caching is enabled
>>> [  741.240088] BTRFS info (device vdd): has skinny extents
>>> [  741.485130] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (26509)
>>> [  741.489012] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (26509)
>>> [  741.517997] BTRFS info (device vdb): turning on discard
>>> [  741.522388] BTRFS info (device vdb): disk space caching is enabled
>>> [  741.527084] BTRFS info (device vdb): has skinny extents
>>> [  741.528865] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  741.536809] BTRFS info (device vdb): checking UUID tree
>>> [  803.614447] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>>> [ 1561.888757] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
>>> [ 1563.133038] BTRFS info (device vdb): scrub: started on devid 2
>>> [ 1563.133043] BTRFS info (device vdb): scrub: started on devid 1
>>> [ 1851.421233] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
>>> [ 1857.639294] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [ 1869.454605] BTRFS info (device vdd): turning on discard
>>> [ 1869.458972] BTRFS info (device vdd): disk space caching is enabled
>>> [ 1869.461062] BTRFS info (device vdd): has skinny extents
>>> [ 1891.972499] BTRFS info (device vdd): dev_replace from /dev/vdd (devid 1) to /dev/vdb started
>>> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
>>> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
>>> (stacktrace)
>>
>>
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-24 16:28   ` Filipe Manana
@ 2020-01-25  0:36     ` Qu Wenruo
  2020-01-25  0:48       ` Qu Wenruo
  2020-01-25 11:35       ` David Sterba
  0 siblings, 2 replies; 17+ messages in thread
From: Qu Wenruo @ 2020-01-25  0:36 UTC (permalink / raw)
  To: fdmanana, dsterba, Qu Wenruo, linux-btrfs, Filipe Manana


[-- Attachment #1.1: Type: text/plain, Size: 14640 bytes --]



On 2020/1/25 上午12:28, Filipe Manana wrote:
> On Fri, Jan 24, 2020 at 3:12 PM David Sterba <dsterba@suse.cz> wrote:
>>
>> On current master branch (4703d9119972bf58) with this patch btrfs/011
>> prints a warning from this code:
>>
>>  502         ret = btrfs_dev_replace_finishing(fs_info, ret);
>>  503         if (ret == -EINPROGRESS) {
>>  504                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
>>  505         } else if (ret != -ECANCELED) {
>>  506                 WARN_ON(ret);
>>  507         }
>>
>> with ret == -ENOSPC
>>
>> The purpose seems to be to catch generic error codes other than
>> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
>> that case. But it' a new ENOSPC problem, likely caused by the
>> read-only status switching.
>>
>> My test devices are 12G, there's full log of the test to see at which
>> phase it happened.
> 
> It passes for me on 20G devices, haven't tested with 12G however
> (can't afford to reboot any of my vms now).

5G for all scratch devices, and failed to reproduce it.
(The full run before submitting the patch also failed to reproduce it)

> 
> I think that happens because before this patch we ignored ENOSPC
> errors, when trying to set a block group to RO, for device replace and
> scrub.
> But with this patch we don't ignore ENOSPC for device replace anymore
> - this is actually correct because if we ignore we can corrupt nocow
> writes (including writes into prealloc extents).
> 
> Now if it's a real ENOSPC situation or just a bug in the space
> management code, it's a different thing to look at.

I tend to take a middle land of the problem.

For current stage, the WARN_ON() is indeed overkilled, at least for ENOSPC.

But on the other handle, the full RO of a block group for scrub/replace
is also a little overkilled.
Just as Filipe mentioned, we only want to kill nocow writes into a block
group, but still allow COW writes.

It looks like something like mark_block_group_nocow_ro() in the future
could reduce the possibility if not fully kill it.


It looks like changing the WARN_ON(ret) to WARN_ON(ret != -ENOSPC) would
be needed for this patch as a quick fix.

Thanks,
Qu

> 
> Thanks.
> 
>>
>> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
>> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
>> [ 1892.059993] ------------[ cut here ]------------
>> [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
>> [ 1892.068628] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
>> [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
>> [ 1892.076559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
>> [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
>> [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
>> [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
>> [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
>> [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
>> [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
>> [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
>> [ 1892.108642] FS:  00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
>> [ 1892.111558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
>> [ 1892.115814] Call Trace:
>> [ 1892.116896]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
>> [ 1892.118962]  btrfs_ioctl+0x1d62/0x2550 [btrfs]
>> [ 1892.120633]  ? __lock_acquire+0x272/0x1320
>> [ 1892.122177]  ? kvm_sched_clock_read+0x14/0x30
>> [ 1892.123629]  ? do_sigaction+0xf8/0x240
>> [ 1892.124919]  ? kvm_sched_clock_read+0x14/0x30
>> [ 1892.126418]  ? sched_clock+0x5/0x10
>> [ 1892.127534]  ? sched_clock_cpu+0x10/0x120
>> [ 1892.129096]  ? do_sigaction+0xf8/0x240
>> [ 1892.130525]  ? do_vfs_ioctl+0x56e/0x770
>> [ 1892.131818]  do_vfs_ioctl+0x56e/0x770
>> [ 1892.133012]  ? do_sigaction+0xf8/0x240
>> [ 1892.134228]  ksys_ioctl+0x3a/0x70
>> [ 1892.135447]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>> [ 1892.137148]  __x64_sys_ioctl+0x16/0x20
>> [ 1892.138702]  do_syscall_64+0x50/0x210
>> [ 1892.140095]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [ 1892.141911] RIP: 0033:0x7fcaed61e387
>> [ 1892.149247] RSP: 002b:00007ffcb47fc2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> [ 1892.151839] RAX: ffffffffffffffda RBX: 00007ffcb47ff12d RCX: 00007fcaed61e387
>> [ 1892.154060] RDX: 00007ffcb47fd160 RSI: 00000000ca289435 RDI: 0000000000000003
>> [ 1892.156191] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
>> [ 1892.158317] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000003
>> [ 1892.160564] R13: 0000000000000001 R14: 0000000000000000 R15: 00005624ff1602e0
>> [ 1892.162736] irq event stamp: 244626
>> [ 1892.164191] hardirqs last  enabled at (244625): [<ffffffff9178860e>] _raw_spin_unlock_irqrestore+0x3e/0x50
>> [ 1892.167531] hardirqs last disabled at (244626): [<ffffffff91002aeb>] trace_hardirqs_off_thunk+0x1a/0x1c
>> [ 1892.170875] softirqs last  enabled at (244482): [<ffffffff91a00358>] __do_softirq+0x358/0x52b
>> [ 1892.174005] softirqs last disabled at (244471): [<ffffffff9108ac1d>] irq_exit+0x9d/0xb0
>> [ 1892.176555] ---[ end trace d72b653b61eb7d20 ]---
>> [failed, exit status 1] [14:22:59]- output mismatch (see /tmp/fstests/results//btrfs/011.out.bad)
>>     --- tests/btrfs/011.out     2018-04-12 16:57:00.608225550 +0000
>>     +++ /tmp/fstests/results//btrfs/011.out.bad 2020-01-24 14:22:59.248000000 +0000
>>     @@ -1,3 +1,4 @@
>>      QA output created by 011
>>      *** test btrfs replace
>>     -*** done
>>     +failed: '/sbin/btrfs replace start -Bf -r /dev/vdd /dev/vdb /tmp/scratch'
>>     +(see /tmp/fstests/results//btrfs/011.full for details)
>>     ...
>>     (Run 'diff -u /tmp/fstests/tests/btrfs/011.out /tmp/fstests/results//btrfs/011.out.bad'  to see the entire diff)
>>
>> btrfs/011               [13:59:50][  504.570047] run fstests btrfs/011 at 2020-01-24 13:59:50
>> [  505.710501] BTRFS: device fsid cd48459b-2332-425b-9d4e-324021eb0f2a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20824)
>> [  505.740305] BTRFS info (device vdb): turning on discard
>> [  505.742954] BTRFS info (device vdb): disk space caching is enabled
>> [  505.745007] BTRFS info (device vdb): has skinny extents
>> [  505.747096] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  505.755093] BTRFS info (device vdb): checking UUID tree
>> [  521.548385] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  525.294200] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>> [  526.798086] BTRFS info (device vdb): scrub: started on devid 1
>> [  528.104425] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  528.393736] BTRFS info (device vdc): turning on discard
>> [  528.396009] BTRFS info (device vdc): disk space caching is enabled
>> [  528.398144] BTRFS info (device vdc): has skinny extents
>> [  528.597381] BTRFS: device fsid 20fd7216-ce75-4761-bb61-a5819aef05b6 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (27027)
>> [  528.626562] BTRFS info (device vdb): turning on discard
>> [  528.629852] BTRFS info (device vdb): disk space caching is enabled
>> [  528.633442] BTRFS info (device vdb): has skinny extents
>> [  528.640914] BTRFS info (device vdb): checking UUID tree
>> [  548.785255] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  551.256550] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>> [  557.620409] BTRFS info (device vdb): scrub: started on devid 1
>> [  559.258036] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  559.646237] BTRFS info (device vdc): turning on discard
>> [  559.649237] BTRFS info (device vdc): disk space caching is enabled
>> [  559.652122] BTRFS info (device vdc): has skinny extents
>> [  561.076431] BTRFS: device fsid e8c0f848-fd83-4401-95ff-503d3bba3c08 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (750)
>> [  561.103870] BTRFS info (device vdb): turning on discard
>> [  561.106896] BTRFS info (device vdb): disk space caching is enabled
>> [  561.110432] BTRFS info (device vdb): has skinny extents
>> [  561.113446] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  561.121762] BTRFS info (device vdb): checking UUID tree
>> [  576.690406] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  581.249317] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>> [  584.252317] BTRFS info (device vdb): scrub: started on devid 1
>> [  585.192576] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  586.225905] BTRFS info (device vdc): turning on discard
>> [  586.229054] BTRFS info (device vdc): disk space caching is enabled
>> [  586.231819] BTRFS info (device vdc): has skinny extents
>> [  586.599308] BTRFS: device fsid dd789d35-4d2f-428c-aa03-88ffd5a734cb devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (6931)
>> [  586.628325] BTRFS info (device vdb): turning on discard
>> [  586.630954] BTRFS info (device vdb): disk space caching is enabled
>> [  586.634146] BTRFS info (device vdb): has skinny extents
>> [  586.636859] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  586.645020] BTRFS info (device vdb): checking UUID tree
>> [  608.342513] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  611.869495] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc canceled
>> [  646.226962] BTRFS info (device vdb): scrub: started on devid 1
>> [  652.537808] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  654.148332] BTRFS info (device vdb): turning on discard
>> [  654.152563] BTRFS info (device vdb): disk space caching is enabled
>> [  654.157499] BTRFS info (device vdb): has skinny extents
>> [  655.318709] BTRFS: device fsid cc2a56c5-5371-4b14-88af-527515683491 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (14080)
>> [  655.341656] BTRFS info (device vdb): turning on discard
>> [  655.344075] BTRFS info (device vdb): disk space caching is enabled
>> [  655.346723] BTRFS info (device vdb): has skinny extents
>> [  655.355368] BTRFS info (device vdb): checking UUID tree
>> [  688.140556] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  696.723022] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>> [  707.940962] BTRFS info (device vdb): scrub: started on devid 1
>> [  711.456414] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  719.263299] BTRFS info (device vdc): turning on discard
>> [  719.266339] BTRFS info (device vdc): disk space caching is enabled
>> [  719.269694] BTRFS info (device vdc): has skinny extents
>> [  722.816437] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20268)
>> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
>> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
>> [  722.850081] BTRFS info (device vdb): turning on discard
>> [  722.853321] BTRFS info (device vdb): disk space caching is enabled
>> [  722.855710] BTRFS info (device vdb): has skinny extents
>> [  722.857894] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  722.866879] BTRFS info (device vdb): checking UUID tree
>> [  738.622392] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>> [  739.728325] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
>> [  740.017266] BTRFS info (device vdb): scrub: started on devid 1
>> [  740.017277] BTRFS info (device vdb): scrub: started on devid 2
>> [  740.714789] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
>> [  740.714965] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  741.234158] BTRFS info (device vdd): turning on discard
>> [  741.237294] BTRFS info (device vdd): disk space caching is enabled
>> [  741.240088] BTRFS info (device vdd): has skinny extents
>> [  741.485130] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (26509)
>> [  741.489012] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (26509)
>> [  741.517997] BTRFS info (device vdb): turning on discard
>> [  741.522388] BTRFS info (device vdb): disk space caching is enabled
>> [  741.527084] BTRFS info (device vdb): has skinny extents
>> [  741.528865] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  741.536809] BTRFS info (device vdb): checking UUID tree
>> [  803.614447] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>> [ 1561.888757] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
>> [ 1563.133038] BTRFS info (device vdb): scrub: started on devid 2
>> [ 1563.133043] BTRFS info (device vdb): scrub: started on devid 1
>> [ 1851.421233] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
>> [ 1857.639294] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [ 1869.454605] BTRFS info (device vdd): turning on discard
>> [ 1869.458972] BTRFS info (device vdd): disk space caching is enabled
>> [ 1869.461062] BTRFS info (device vdd): has skinny extents
>> [ 1891.972499] BTRFS info (device vdd): dev_replace from /dev/vdd (devid 1) to /dev/vdb started
>> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
>> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
>> (stacktrace)
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-24 14:44 ` David Sterba
@ 2020-01-24 16:28   ` Filipe Manana
  2020-01-25  0:36     ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2020-01-24 16:28 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Filipe Manana

On Fri, Jan 24, 2020 at 3:12 PM David Sterba <dsterba@suse.cz> wrote:
>
> On current master branch (4703d9119972bf58) with this patch btrfs/011
> prints a warning from this code:
>
>  502         ret = btrfs_dev_replace_finishing(fs_info, ret);
>  503         if (ret == -EINPROGRESS) {
>  504                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
>  505         } else if (ret != -ECANCELED) {
>  506                 WARN_ON(ret);
>  507         }
>
> with ret == -ENOSPC
>
> The purpose seems to be to catch generic error codes other than
> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
> that case. But it' a new ENOSPC problem, likely caused by the
> read-only status switching.
>
> My test devices are 12G, there's full log of the test to see at which
> phase it happened.

It passes for me on 20G devices, haven't tested with 12G however
(can't afford to reboot any of my vms now).

I think that happens because before this patch we ignored ENOSPC
errors, when trying to set a block group to RO, for device replace and
scrub.
But with this patch we don't ignore ENOSPC for device replace anymore
- this is actually correct because if we ignore we can corrupt nocow
writes (including writes into prealloc extents).

Now if it's a real ENOSPC situation or just a bug in the space
management code, it's a different thing to look at.

Thanks.

>
> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
> [ 1892.059993] ------------[ cut here ]------------
> [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
> [ 1892.068628] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
> [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
> [ 1892.076559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
> [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
> [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
> [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
> [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
> [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
> [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
> [ 1892.108642] FS:  00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
> [ 1892.111558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
> [ 1892.115814] Call Trace:
> [ 1892.116896]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
> [ 1892.118962]  btrfs_ioctl+0x1d62/0x2550 [btrfs]
> [ 1892.120633]  ? __lock_acquire+0x272/0x1320
> [ 1892.122177]  ? kvm_sched_clock_read+0x14/0x30
> [ 1892.123629]  ? do_sigaction+0xf8/0x240
> [ 1892.124919]  ? kvm_sched_clock_read+0x14/0x30
> [ 1892.126418]  ? sched_clock+0x5/0x10
> [ 1892.127534]  ? sched_clock_cpu+0x10/0x120
> [ 1892.129096]  ? do_sigaction+0xf8/0x240
> [ 1892.130525]  ? do_vfs_ioctl+0x56e/0x770
> [ 1892.131818]  do_vfs_ioctl+0x56e/0x770
> [ 1892.133012]  ? do_sigaction+0xf8/0x240
> [ 1892.134228]  ksys_ioctl+0x3a/0x70
> [ 1892.135447]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 1892.137148]  __x64_sys_ioctl+0x16/0x20
> [ 1892.138702]  do_syscall_64+0x50/0x210
> [ 1892.140095]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 1892.141911] RIP: 0033:0x7fcaed61e387
> [ 1892.149247] RSP: 002b:00007ffcb47fc2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1892.151839] RAX: ffffffffffffffda RBX: 00007ffcb47ff12d RCX: 00007fcaed61e387
> [ 1892.154060] RDX: 00007ffcb47fd160 RSI: 00000000ca289435 RDI: 0000000000000003
> [ 1892.156191] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
> [ 1892.158317] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000003
> [ 1892.160564] R13: 0000000000000001 R14: 0000000000000000 R15: 00005624ff1602e0
> [ 1892.162736] irq event stamp: 244626
> [ 1892.164191] hardirqs last  enabled at (244625): [<ffffffff9178860e>] _raw_spin_unlock_irqrestore+0x3e/0x50
> [ 1892.167531] hardirqs last disabled at (244626): [<ffffffff91002aeb>] trace_hardirqs_off_thunk+0x1a/0x1c
> [ 1892.170875] softirqs last  enabled at (244482): [<ffffffff91a00358>] __do_softirq+0x358/0x52b
> [ 1892.174005] softirqs last disabled at (244471): [<ffffffff9108ac1d>] irq_exit+0x9d/0xb0
> [ 1892.176555] ---[ end trace d72b653b61eb7d20 ]---
> [failed, exit status 1] [14:22:59]- output mismatch (see /tmp/fstests/results//btrfs/011.out.bad)
>     --- tests/btrfs/011.out     2018-04-12 16:57:00.608225550 +0000
>     +++ /tmp/fstests/results//btrfs/011.out.bad 2020-01-24 14:22:59.248000000 +0000
>     @@ -1,3 +1,4 @@
>      QA output created by 011
>      *** test btrfs replace
>     -*** done
>     +failed: '/sbin/btrfs replace start -Bf -r /dev/vdd /dev/vdb /tmp/scratch'
>     +(see /tmp/fstests/results//btrfs/011.full for details)
>     ...
>     (Run 'diff -u /tmp/fstests/tests/btrfs/011.out /tmp/fstests/results//btrfs/011.out.bad'  to see the entire diff)
>
> btrfs/011               [13:59:50][  504.570047] run fstests btrfs/011 at 2020-01-24 13:59:50
> [  505.710501] BTRFS: device fsid cd48459b-2332-425b-9d4e-324021eb0f2a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20824)
> [  505.740305] BTRFS info (device vdb): turning on discard
> [  505.742954] BTRFS info (device vdb): disk space caching is enabled
> [  505.745007] BTRFS info (device vdb): has skinny extents
> [  505.747096] BTRFS info (device vdb): flagging fs with big metadata feature
> [  505.755093] BTRFS info (device vdb): checking UUID tree
> [  521.548385] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  525.294200] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
> [  526.798086] BTRFS info (device vdb): scrub: started on devid 1
> [  528.104425] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  528.393736] BTRFS info (device vdc): turning on discard
> [  528.396009] BTRFS info (device vdc): disk space caching is enabled
> [  528.398144] BTRFS info (device vdc): has skinny extents
> [  528.597381] BTRFS: device fsid 20fd7216-ce75-4761-bb61-a5819aef05b6 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (27027)
> [  528.626562] BTRFS info (device vdb): turning on discard
> [  528.629852] BTRFS info (device vdb): disk space caching is enabled
> [  528.633442] BTRFS info (device vdb): has skinny extents
> [  528.640914] BTRFS info (device vdb): checking UUID tree
> [  548.785255] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  551.256550] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
> [  557.620409] BTRFS info (device vdb): scrub: started on devid 1
> [  559.258036] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  559.646237] BTRFS info (device vdc): turning on discard
> [  559.649237] BTRFS info (device vdc): disk space caching is enabled
> [  559.652122] BTRFS info (device vdc): has skinny extents
> [  561.076431] BTRFS: device fsid e8c0f848-fd83-4401-95ff-503d3bba3c08 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (750)
> [  561.103870] BTRFS info (device vdb): turning on discard
> [  561.106896] BTRFS info (device vdb): disk space caching is enabled
> [  561.110432] BTRFS info (device vdb): has skinny extents
> [  561.113446] BTRFS info (device vdb): flagging fs with big metadata feature
> [  561.121762] BTRFS info (device vdb): checking UUID tree
> [  576.690406] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  581.249317] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
> [  584.252317] BTRFS info (device vdb): scrub: started on devid 1
> [  585.192576] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  586.225905] BTRFS info (device vdc): turning on discard
> [  586.229054] BTRFS info (device vdc): disk space caching is enabled
> [  586.231819] BTRFS info (device vdc): has skinny extents
> [  586.599308] BTRFS: device fsid dd789d35-4d2f-428c-aa03-88ffd5a734cb devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (6931)
> [  586.628325] BTRFS info (device vdb): turning on discard
> [  586.630954] BTRFS info (device vdb): disk space caching is enabled
> [  586.634146] BTRFS info (device vdb): has skinny extents
> [  586.636859] BTRFS info (device vdb): flagging fs with big metadata feature
> [  586.645020] BTRFS info (device vdb): checking UUID tree
> [  608.342513] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  611.869495] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc canceled
> [  646.226962] BTRFS info (device vdb): scrub: started on devid 1
> [  652.537808] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  654.148332] BTRFS info (device vdb): turning on discard
> [  654.152563] BTRFS info (device vdb): disk space caching is enabled
> [  654.157499] BTRFS info (device vdb): has skinny extents
> [  655.318709] BTRFS: device fsid cc2a56c5-5371-4b14-88af-527515683491 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (14080)
> [  655.341656] BTRFS info (device vdb): turning on discard
> [  655.344075] BTRFS info (device vdb): disk space caching is enabled
> [  655.346723] BTRFS info (device vdb): has skinny extents
> [  655.355368] BTRFS info (device vdb): checking UUID tree
> [  688.140556] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  696.723022] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
> [  707.940962] BTRFS info (device vdb): scrub: started on devid 1
> [  711.456414] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  719.263299] BTRFS info (device vdc): turning on discard
> [  719.266339] BTRFS info (device vdc): disk space caching is enabled
> [  719.269694] BTRFS info (device vdc): has skinny extents
> [  722.816437] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20268)
> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
> [  722.850081] BTRFS info (device vdb): turning on discard
> [  722.853321] BTRFS info (device vdb): disk space caching is enabled
> [  722.855710] BTRFS info (device vdb): has skinny extents
> [  722.857894] BTRFS info (device vdb): flagging fs with big metadata feature
> [  722.866879] BTRFS info (device vdb): checking UUID tree
> [  738.622392] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
> [  739.728325] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
> [  740.017266] BTRFS info (device vdb): scrub: started on devid 1
> [  740.017277] BTRFS info (device vdb): scrub: started on devid 2
> [  740.714789] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
> [  740.714965] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  741.234158] BTRFS info (device vdd): turning on discard
> [  741.237294] BTRFS info (device vdd): disk space caching is enabled
> [  741.240088] BTRFS info (device vdd): has skinny extents
> [  741.485130] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (26509)
> [  741.489012] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (26509)
> [  741.517997] BTRFS info (device vdb): turning on discard
> [  741.522388] BTRFS info (device vdb): disk space caching is enabled
> [  741.527084] BTRFS info (device vdb): has skinny extents
> [  741.528865] BTRFS info (device vdb): flagging fs with big metadata feature
> [  741.536809] BTRFS info (device vdb): checking UUID tree
> [  803.614447] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
> [ 1561.888757] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
> [ 1563.133038] BTRFS info (device vdb): scrub: started on devid 2
> [ 1563.133043] BTRFS info (device vdb): scrub: started on devid 1
> [ 1851.421233] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
> [ 1857.639294] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [ 1869.454605] BTRFS info (device vdd): turning on discard
> [ 1869.458972] BTRFS info (device vdd): disk space caching is enabled
> [ 1869.461062] BTRFS info (device vdd): has skinny extents
> [ 1891.972499] BTRFS info (device vdd): dev_replace from /dev/vdd (devid 1) to /dev/vdb started
> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
> (stacktrace)



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23 23:58 Qu Wenruo
  2020-01-24  9:24 ` Filipe Manana
@ 2020-01-24 14:44 ` David Sterba
  2020-01-24 16:28   ` Filipe Manana
  1 sibling, 1 reply; 17+ messages in thread
From: David Sterba @ 2020-01-24 14:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba, Filipe Manana

On current master branch (4703d9119972bf58) with this patch btrfs/011
prints a warning from this code:

 502         ret = btrfs_dev_replace_finishing(fs_info, ret);
 503         if (ret == -EINPROGRESS) {
 504                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
 505         } else if (ret != -ECANCELED) {
 506                 WARN_ON(ret);
 507         }

with ret == -ENOSPC

The purpose seems to be to catch generic error codes other than
EINPROGRESS and ECNACELED, I don't see much point printing a warning in
that case. But it' a new ENOSPC problem, likely caused by the
read-only status switching.

My test devices are 12G, there's full log of the test to see at which
phase it happened.

[ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
[ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
[ 1892.059993] ------------[ cut here ]------------
[ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.068628] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
[ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
[ 1892.076559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
[ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
[ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
[ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
[ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
[ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
[ 1892.108642] FS:  00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
[ 1892.111558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
[ 1892.115814] Call Trace:
[ 1892.116896]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
[ 1892.118962]  btrfs_ioctl+0x1d62/0x2550 [btrfs]
[ 1892.120633]  ? __lock_acquire+0x272/0x1320
[ 1892.122177]  ? kvm_sched_clock_read+0x14/0x30
[ 1892.123629]  ? do_sigaction+0xf8/0x240
[ 1892.124919]  ? kvm_sched_clock_read+0x14/0x30
[ 1892.126418]  ? sched_clock+0x5/0x10
[ 1892.127534]  ? sched_clock_cpu+0x10/0x120
[ 1892.129096]  ? do_sigaction+0xf8/0x240
[ 1892.130525]  ? do_vfs_ioctl+0x56e/0x770
[ 1892.131818]  do_vfs_ioctl+0x56e/0x770
[ 1892.133012]  ? do_sigaction+0xf8/0x240
[ 1892.134228]  ksys_ioctl+0x3a/0x70
[ 1892.135447]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1892.137148]  __x64_sys_ioctl+0x16/0x20
[ 1892.138702]  do_syscall_64+0x50/0x210
[ 1892.140095]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1892.141911] RIP: 0033:0x7fcaed61e387
[ 1892.149247] RSP: 002b:00007ffcb47fc2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1892.151839] RAX: ffffffffffffffda RBX: 00007ffcb47ff12d RCX: 00007fcaed61e387
[ 1892.154060] RDX: 00007ffcb47fd160 RSI: 00000000ca289435 RDI: 0000000000000003
[ 1892.156191] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[ 1892.158317] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000003
[ 1892.160564] R13: 0000000000000001 R14: 0000000000000000 R15: 00005624ff1602e0
[ 1892.162736] irq event stamp: 244626
[ 1892.164191] hardirqs last  enabled at (244625): [<ffffffff9178860e>] _raw_spin_unlock_irqrestore+0x3e/0x50
[ 1892.167531] hardirqs last disabled at (244626): [<ffffffff91002aeb>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 1892.170875] softirqs last  enabled at (244482): [<ffffffff91a00358>] __do_softirq+0x358/0x52b
[ 1892.174005] softirqs last disabled at (244471): [<ffffffff9108ac1d>] irq_exit+0x9d/0xb0
[ 1892.176555] ---[ end trace d72b653b61eb7d20 ]---
[failed, exit status 1] [14:22:59]- output mismatch (see /tmp/fstests/results//btrfs/011.out.bad)
    --- tests/btrfs/011.out     2018-04-12 16:57:00.608225550 +0000
    +++ /tmp/fstests/results//btrfs/011.out.bad 2020-01-24 14:22:59.248000000 +0000
    @@ -1,3 +1,4 @@
     QA output created by 011
     *** test btrfs replace
    -*** done
    +failed: '/sbin/btrfs replace start -Bf -r /dev/vdd /dev/vdb /tmp/scratch'
    +(see /tmp/fstests/results//btrfs/011.full for details)
    ...
    (Run 'diff -u /tmp/fstests/tests/btrfs/011.out /tmp/fstests/results//btrfs/011.out.bad'  to see the entire diff)

btrfs/011               [13:59:50][  504.570047] run fstests btrfs/011 at 2020-01-24 13:59:50
[  505.710501] BTRFS: device fsid cd48459b-2332-425b-9d4e-324021eb0f2a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20824)
[  505.740305] BTRFS info (device vdb): turning on discard
[  505.742954] BTRFS info (device vdb): disk space caching is enabled
[  505.745007] BTRFS info (device vdb): has skinny extents
[  505.747096] BTRFS info (device vdb): flagging fs with big metadata feature
[  505.755093] BTRFS info (device vdb): checking UUID tree
[  521.548385] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  525.294200] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
[  526.798086] BTRFS info (device vdb): scrub: started on devid 1
[  528.104425] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  528.393736] BTRFS info (device vdc): turning on discard
[  528.396009] BTRFS info (device vdc): disk space caching is enabled
[  528.398144] BTRFS info (device vdc): has skinny extents
[  528.597381] BTRFS: device fsid 20fd7216-ce75-4761-bb61-a5819aef05b6 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (27027)
[  528.626562] BTRFS info (device vdb): turning on discard
[  528.629852] BTRFS info (device vdb): disk space caching is enabled
[  528.633442] BTRFS info (device vdb): has skinny extents
[  528.640914] BTRFS info (device vdb): checking UUID tree
[  548.785255] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  551.256550] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
[  557.620409] BTRFS info (device vdb): scrub: started on devid 1
[  559.258036] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  559.646237] BTRFS info (device vdc): turning on discard
[  559.649237] BTRFS info (device vdc): disk space caching is enabled
[  559.652122] BTRFS info (device vdc): has skinny extents
[  561.076431] BTRFS: device fsid e8c0f848-fd83-4401-95ff-503d3bba3c08 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (750)
[  561.103870] BTRFS info (device vdb): turning on discard
[  561.106896] BTRFS info (device vdb): disk space caching is enabled
[  561.110432] BTRFS info (device vdb): has skinny extents
[  561.113446] BTRFS info (device vdb): flagging fs with big metadata feature
[  561.121762] BTRFS info (device vdb): checking UUID tree
[  576.690406] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  581.249317] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
[  584.252317] BTRFS info (device vdb): scrub: started on devid 1
[  585.192576] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  586.225905] BTRFS info (device vdc): turning on discard
[  586.229054] BTRFS info (device vdc): disk space caching is enabled
[  586.231819] BTRFS info (device vdc): has skinny extents
[  586.599308] BTRFS: device fsid dd789d35-4d2f-428c-aa03-88ffd5a734cb devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (6931)
[  586.628325] BTRFS info (device vdb): turning on discard
[  586.630954] BTRFS info (device vdb): disk space caching is enabled
[  586.634146] BTRFS info (device vdb): has skinny extents
[  586.636859] BTRFS info (device vdb): flagging fs with big metadata feature
[  586.645020] BTRFS info (device vdb): checking UUID tree
[  608.342513] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  611.869495] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc canceled
[  646.226962] BTRFS info (device vdb): scrub: started on devid 1
[  652.537808] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  654.148332] BTRFS info (device vdb): turning on discard
[  654.152563] BTRFS info (device vdb): disk space caching is enabled
[  654.157499] BTRFS info (device vdb): has skinny extents
[  655.318709] BTRFS: device fsid cc2a56c5-5371-4b14-88af-527515683491 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (14080)
[  655.341656] BTRFS info (device vdb): turning on discard
[  655.344075] BTRFS info (device vdb): disk space caching is enabled
[  655.346723] BTRFS info (device vdb): has skinny extents
[  655.355368] BTRFS info (device vdb): checking UUID tree
[  688.140556] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  696.723022] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
[  707.940962] BTRFS info (device vdb): scrub: started on devid 1
[  711.456414] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  719.263299] BTRFS info (device vdc): turning on discard
[  719.266339] BTRFS info (device vdc): disk space caching is enabled
[  719.269694] BTRFS info (device vdc): has skinny extents
[  722.816437] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20268)
[  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
[  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
[  722.850081] BTRFS info (device vdb): turning on discard
[  722.853321] BTRFS info (device vdb): disk space caching is enabled
[  722.855710] BTRFS info (device vdb): has skinny extents
[  722.857894] BTRFS info (device vdb): flagging fs with big metadata feature
[  722.866879] BTRFS info (device vdb): checking UUID tree
[  738.622392] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
[  739.728325] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
[  740.017266] BTRFS info (device vdb): scrub: started on devid 1
[  740.017277] BTRFS info (device vdb): scrub: started on devid 2
[  740.714789] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
[  740.714965] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  741.234158] BTRFS info (device vdd): turning on discard
[  741.237294] BTRFS info (device vdd): disk space caching is enabled
[  741.240088] BTRFS info (device vdd): has skinny extents
[  741.485130] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (26509)
[  741.489012] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (26509)
[  741.517997] BTRFS info (device vdb): turning on discard
[  741.522388] BTRFS info (device vdb): disk space caching is enabled
[  741.527084] BTRFS info (device vdb): has skinny extents
[  741.528865] BTRFS info (device vdb): flagging fs with big metadata feature
[  741.536809] BTRFS info (device vdb): checking UUID tree
[  803.614447] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
[ 1561.888757] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
[ 1563.133038] BTRFS info (device vdb): scrub: started on devid 2
[ 1563.133043] BTRFS info (device vdb): scrub: started on devid 1
[ 1851.421233] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
[ 1857.639294] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[ 1869.454605] BTRFS info (device vdd): turning on discard
[ 1869.458972] BTRFS info (device vdd): disk space caching is enabled
[ 1869.461062] BTRFS info (device vdd): has skinny extents
[ 1891.972499] BTRFS info (device vdd): dev_replace from /dev/vdd (devid 1) to /dev/vdb started
[ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
[ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
(stacktrace)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-24  9:24 ` Filipe Manana
@ 2020-01-24 13:40   ` David Sterba
  0 siblings, 0 replies; 17+ messages in thread
From: David Sterba @ 2020-01-24 13:40 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs, dsterba, Filipe Manana

On Fri, Jan 24, 2020 at 09:24:50AM +0000, Filipe Manana wrote:
> On Fri, Jan 24, 2020 at 12:07 AM Qu Wenruo <wqu@suse.com> wrote:
> > Reported-by: Filipe Manana <fdmanana@suse.com>
> > Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> > Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Now it looks good, thanks!

Thank you both, patch on the way to 5.5-rc8 or 5.6 final.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
  2020-01-23 23:58 Qu Wenruo
@ 2020-01-24  9:24 ` Filipe Manana
  2020-01-24 13:40   ` David Sterba
  2020-01-24 14:44 ` David Sterba
  1 sibling, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2020-01-24  9:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba, Filipe Manana

On Fri, Jan 24, 2020 at 12:07 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> looped runs can lead to random failure, where scrub finds csum error.
>
> The possibility is not high, around 1/20 to 1/100, but it's causing data
> corruption.
>
> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> check free space before marking a block group RO")
>
> [CAUSE]
> Dev-replace has two source of writes:
> - Write duplication
>   All writes to source device will also be duplicated to target device.
>
>   Content:      Not yet persisted data/meta
>
> - Scrub copy
>   Dev-replace reused scrub code to iterate through existing extents, and
>   copy the verified data to target device.
>
>   Content:      Previously persisted data and metadata
>
> The difference in contents makes the following race possible:
>         Regular Writer          |       Dev-replace
> -----------------------------------------------------------------
>   ^                             |
>   | Preallocate one data extent |
>   | at bytenr X, len 1M         |
>   v                             |
>   ^ Commit transaction          |
>   | Now extent [X, X+1M) is in  |
>   v commit root                 |
>  ================== Dev replace starts =========================
>                                 | ^
>                                 | | Scrub extent [X, X+1M)
>                                 | | Read [X, X+1M)
>                                 | | (The content are mostly garbage
>                                 | |  since it's preallocated)
>   ^                             | v
>   | Write back happens for      |
>   | extent [X, X+512K)          |
>   | New data writes to both     |
>   | source and target dev.      |
>   v                             |
>                                 | ^
>                                 | | Scrub writes back extent [X, X+1M)
>                                 | | to target device.
>                                 | | This will over write the new data in
>                                 | | [X, X+512K)
>                                 | v
>
> This race can only happen for nocow writes. Thus metadata and data cow
> writes are safe, as COW will never overwrite extents of previous trans
> (in commit root).
>
> This behavior can be confirmed by disabling all fallocate related calls
> in fsstress (*), then all related tests can pass a 2000 run loop.
>
> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>                    -f collapse=0 -f punch=0 -f resvsp=0"
>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>
> [FIX]
> Make dev-replace to require mandatory block group RO, and wait for current
> nocow writes before calling scrub_chunk().
>
> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> when set_block_ro failed") for dev-replace path.
>
> The side effect is, dev-replace can be more strict on avaialble space, but
> definitely worthy to avoid data corruption.
>
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Now it looks good, thanks!

> ---
> Changelog:
> RFC->v1:
> - Remove the RFC tag
>   Since the cause is pinned and verified, no need for RFC.
>
> - Only wait for nocow writes for dev-replace
>   CoW writes are safe as they will never overwrite extents in commit
>   root.
>
> - Put the wait call into proper lock context
>   Previous wait happens after scrub_pause_off(), which can cause
>   deadlock where we may need to commit transaction in one of the
>   wait calls. But since we are in scrub_pause_off() environment,
>   transaction commit will wait us to continue, causing a wait-on-self
>   deadlock.
>
> v2:
> - Add btrfs_wait_ordered_roots() call before scrub_chunk().
> - Commit message change to avoid confusion.
> ---
>  fs/btrfs/scrub.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 21de630b0730..fd266a2d15ec 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>                  * thread can't be triggered fast enough, and use up all space
>                  * of btrfs_super_block::sys_chunk_array
> +                *
> +                * While for dev replace, we need to try our best to mark block
> +                * group RO, to prevent race between:
> +                * - Write duplication
> +                *   Contains latest data
> +                * - Scrub copy
> +                *   Contains data from commit tree
> +                *
> +                * If target block group is not marked RO, nocow writes can
> +                * be overwritten by scrub copy, causing data corruption.
> +                * So for dev-replace, it's not allowed to continue if a block
> +                * group is not RO.
>                  */
> -               ret = btrfs_inc_block_group_ro(cache, false);
> -               scrub_pause_off(fs_info);
> -
> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
>                 if (ret == 0) {
>                         ro_set = 1;
> -               } else if (ret == -ENOSPC) {
> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
>                         /*
>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>                          * failed in creating new chunk for metadata.
> -                        * It is not a problem for scrub/replace, because
> +                        * It is not a problem for scrub, because
>                          * metadata are always cowed, and our scrub paused
>                          * commit_transactions.
>                          */
> @@ -3596,9 +3606,22 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                         btrfs_warn(fs_info,
>                                    "failed setting block group ro: %d", ret);
>                         btrfs_put_block_group(cache);
> +                       scrub_pause_off(fs_info);
>                         break;
>                 }
>
> +               /*
> +                * Now the target block is marked RO, wait for nocow writes to
> +                * finish before dev-replace.
> +                * COW is fine, as COW never overwrites extents in commit tree.
> +                */
> +               if (sctx->is_dev_replace) {
> +                       btrfs_wait_nocow_writers(cache);
> +                       btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start,
> +                                       cache->length);
> +               }
> +
> +               scrub_pause_off(fs_info);
>                 down_write(&dev_replace->rwsem);
>                 dev_replace->cursor_right = found_key.offset + length;
>                 dev_replace->cursor_left = found_key.offset;
> --
> 2.25.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace
@ 2020-01-23 23:58 Qu Wenruo
  2020-01-24  9:24 ` Filipe Manana
  2020-01-24 14:44 ` David Sterba
  0 siblings, 2 replies; 17+ messages in thread
From: Qu Wenruo @ 2020-01-23 23:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Filipe Manana

[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.

The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.

The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")

[CAUSE]
Dev-replace has two source of writes:
- Write duplication
  All writes to source device will also be duplicated to target device.

  Content:	Not yet persisted data/meta

- Scrub copy
  Dev-replace reused scrub code to iterate through existing extents, and
  copy the verified data to target device.

  Content:	Previously persisted data and metadata

The difference in contents makes the following race possible:
	Regular Writer		|	Dev-replace
-----------------------------------------------------------------
  ^                             |
  | Preallocate one data extent |
  | at bytenr X, len 1M		|
  v				|
  ^ Commit transaction		|
  | Now extent [X, X+1M) is in  |
  v commit root			|
 ================== Dev replace starts =========================
  				| ^
				| | Scrub extent [X, X+1M)
				| | Read [X, X+1M)
				| | (The content are mostly garbage
				| |  since it's preallocated)
  ^				| v
  | Write back happens for	|
  | extent [X, X+512K)		|
  | New data writes to both	|
  | source and target dev.	|
  v				|
				| ^
				| | Scrub writes back extent [X, X+1M)
				| | to target device.
				| | This will over write the new data in
				| | [X, X+512K)
				| v

This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous trans
(in commit root).

This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.

*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
		   -f collapse=0 -f punch=0 -f resvsp=0"
   I didn't expect resvsp ioctl will fallback to fallocate in VFS...

[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().

This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.

The side effect is, dev-replace can be more strict on avaialble space, but
definitely worthy to avoid data corruption.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
RFC->v1:
- Remove the RFC tag
  Since the cause is pinned and verified, no need for RFC.

- Only wait for nocow writes for dev-replace
  CoW writes are safe as they will never overwrite extents in commit
  root.

- Put the wait call into proper lock context
  Previous wait happens after scrub_pause_off(), which can cause
  deadlock where we may need to commit transaction in one of the
  wait calls. But since we are in scrub_pause_off() environment,
  transaction commit will wait us to continue, causing a wait-on-self
  deadlock.

v2:
- Add btrfs_wait_ordered_roots() call before scrub_chunk().
- Commit message change to avoid confusion.
---
 fs/btrfs/scrub.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 21de630b0730..fd266a2d15ec 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * This can easily boost the amount of SYSTEM chunks if cleaner
 		 * thread can't be triggered fast enough, and use up all space
 		 * of btrfs_super_block::sys_chunk_array
+		 *
+		 * While for dev replace, we need to try our best to mark block
+		 * group RO, to prevent race between:
+		 * - Write duplication
+		 *   Contains latest data
+		 * - Scrub copy
+		 *   Contains data from commit tree
+		 *
+		 * If target block group is not marked RO, nocow writes can
+		 * be overwritten by scrub copy, causing data corruption.
+		 * So for dev-replace, it's not allowed to continue if a block
+		 * group is not RO.
 		 */
-		ret = btrfs_inc_block_group_ro(cache, false);
-		scrub_pause_off(fs_info);
-
+		ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
 		if (ret == 0) {
 			ro_set = 1;
-		} else if (ret == -ENOSPC) {
+		} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
 			/*
 			 * btrfs_inc_block_group_ro return -ENOSPC when it
 			 * failed in creating new chunk for metadata.
-			 * It is not a problem for scrub/replace, because
+			 * It is not a problem for scrub, because
 			 * metadata are always cowed, and our scrub paused
 			 * commit_transactions.
 			 */
@@ -3596,9 +3606,22 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			btrfs_warn(fs_info,
 				   "failed setting block group ro: %d", ret);
 			btrfs_put_block_group(cache);
+			scrub_pause_off(fs_info);
 			break;
 		}
 
+		/*
+		 * Now the target block is marked RO, wait for nocow writes to
+		 * finish before dev-replace.
+		 * COW is fine, as COW never overwrites extents in commit tree.
+		 */
+		if (sctx->is_dev_replace) {
+			btrfs_wait_nocow_writers(cache);
+			btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start,
+					cache->length);
+		}
+
+		scrub_pause_off(fs_info);
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_right = found_key.offset + length;
 		dev_replace->cursor_left = found_key.offset;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-01-25 12:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-23  7:37 [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace Qu Wenruo
2020-01-23 12:06 ` Filipe Manana
2020-01-23 12:28   ` Qu WenRuo
2020-01-23 13:39   ` Qu Wenruo
2020-01-23 13:49     ` Filipe Manana
2020-01-23 13:57       ` Qu Wenruo
2020-01-23 16:40 ` David Sterba
2020-01-23 23:58   ` Qu Wenruo
2020-01-23 23:58 Qu Wenruo
2020-01-24  9:24 ` Filipe Manana
2020-01-24 13:40   ` David Sterba
2020-01-24 14:44 ` David Sterba
2020-01-24 16:28   ` Filipe Manana
2020-01-25  0:36     ` Qu Wenruo
2020-01-25  0:48       ` Qu Wenruo
2020-01-25 11:35       ` David Sterba
2020-01-25 12:09         ` Qu WenRuo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.