All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes
@ 2023-01-09  6:11 Qu Wenruo
  2023-01-09  6:11 ` [PATCH] btrfs: fix resolving backrefs for inline extent followed by prealloc Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Qu Wenruo @ 2023-01-09  6:11 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When scrubing an empty fs with RAID0, we will call scrub_simple_mirror()
again and again on ranges which has no extent at all.

This is especially obvious if we have both RAID0 and SINGLE.

 # mkfs.btrfs -f -m single -d raid0 $dev
 # mount $dev $mnt
 # xfs_io -f -c "pwrite 0 4k" $mnt/file
 # sync
 # btrfs scrub start -B $mnt

With extra call trace on scrub_simple_mirror(), we got the following
trace:

  256.028473: scrub_simple_mirror: logical=1048576 len=4194304 bg=1048576 bg_len=4194304
  256.028930: scrub_simple_mirror: logical=5242880 len=8388608 bg=5242880 bg_len=8388608
  256.029891: scrub_simple_mirror: logical=22020096 len=65536 bg=22020096 bg_len=1073741824
  256.029892: scrub_simple_mirror: logical=22085632 len=65536 bg=22020096 bg_len=1073741824
  256.029893: scrub_simple_mirror: logical=22151168 len=65536 bg=22020096 bg_len=1073741824
  ... 16K lines skipped ...
  256.048777: scrub_simple_mirror: logical=1095630848 len=65536 bg=22020096 bg_len=1073741824
  256.048778: scrub_simple_mirror: logical=1095696384 len=65536 bg=22020096 bg_len=1073741824

The first two lines shows we just call scrub_simple_mirror() for the
metadata and system chunks once.

But later 16K lines are all scrub_simple_mirror() for the almost empty
RAID0 data block group.

Most of the calls would exit very quickly since there is no extent in
that data chunk.

[CAUSE]
For RAID0/RAID10 we go scrub_simple_stripe() to handle the scrub for the
block group. And since inside each stripe it's just plain SINGLE/RAID1,
thus we reuse scrub_simple_mirror().

But there is a pitfall, that inside scrub_simple_mirror() we will do at
least one extent tree search to find the extent in the range.

Just like above case, we can have a huge gap which has no extent in them
at all.
In that case, we will do extent tree search again and again, even we
already know there is no more extent in the block group.

[FIX]
To fix the super inefficient extent tree search, we introduce
@found_next parameter for the following functions:

- find_first_extent_item()
- scrub_simple_mirror()

If the function find_first_extent_item() returns 1 and @found_next
pointer is provided, it will store the bytenr of the bytenr of the next
extent (if at the end of the extent tree, U64_MAX is used).

So for scrub_simple_stripe(), after scrubing the current stripe and
increased the logical bytenr, we check if our next range reaches
@found_next.

If not, increase our @cur_logical by our increment until we reached
@found_next.

By this, even for an almost empty RAID0 block group, we just execute
"cur_logical += logical_increment;" 16K times, not doing tree search 16K
times.

With the optimization, the same trace looks like this now:

  1283.376212: scrub_simple_mirror: logical=1048576 len=4194304 bg=1048576 bg_len=4194304
  1283.376754: scrub_simple_mirror: logical=5242880 len=8388608 bg=5242880 bg_len=8388608
  1283.377623: scrub_simple_mirror: logical=22020096 len=65536 bg=22020096 bg_len=1073741824
  1283.377625: scrub_simple_mirror: logical=67108864 len=65536 bg=22020096 bg_len=1073741824
  1283.377627: scrub_simple_mirror: logical=67174400 len=65536 bg=22020096 bg_len=1073741824

Note the scrub at logical 67108864, that's because the 4K write only
lands there, not at the beginning of the data chunk (due to super block
reserved space split the 1G chunk into two parts).

And the time duration of the chunk 22020096 is much shorter
(18887us vs 4us).

Unfortunately this optimization only works for RAID0/RAID10 with big
holes in the block group.

For real world cases it's much harder to find huge gaps (although we can
still skip several stripes).
And even for the huge gap cases, the optimization itself is hardly
observable (less than 1 second even for an almost empty 10G block group).

And also unfortunately for RAID5 data stripes, we can not go the similar
optimization for RAID0/RAID10 due to the extra rotation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 46 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 52b346795f66..c60cd4fd9355 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3066,7 +3066,8 @@ static int compare_extent_item_range(struct btrfs_path *path,
  */
 static int find_first_extent_item(struct btrfs_root *extent_root,
 				  struct btrfs_path *path,
-				  u64 search_start, u64 search_len)
+				  u64 search_start, u64 search_len,
+				  u64 *found_next)
 {
 	struct btrfs_fs_info *fs_info = extent_root->fs_info;
 	struct btrfs_key key;
@@ -3102,8 +3103,11 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
 search_forward:
 	while (true) {
 		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-		if (key.objectid >= search_start + search_len)
+		if (key.objectid >= search_start + search_len) {
+			if (found_next)
+				*found_next = key.objectid;
 			break;
+		}
 		if (key.type != BTRFS_METADATA_ITEM_KEY &&
 		    key.type != BTRFS_EXTENT_ITEM_KEY)
 			goto next;
@@ -3111,8 +3115,11 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
 		ret = compare_extent_item_range(path, search_start, search_len);
 		if (ret == 0)
 			return ret;
-		if (ret > 0)
+		if (ret > 0) {
+			if (found_next)
+				*found_next = key.objectid;
 			break;
+		}
 next:
 		path->slots[0]++;
 		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
@@ -3120,6 +3127,13 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
 			if (ret) {
 				/* Either no more item or fatal error */
 				btrfs_release_path(path);
+
+				/*
+				 * No more extent tree items, set *found_next
+				 * directly to U64_MAX.
+				 */
+				if (ret > 0 && found_next)
+					*found_next = U64_MAX;
 				return ret;
 			}
 		}
@@ -3186,7 +3200,8 @@ static int scrub_raid56_data_stripe_for_parity(struct scrub_ctx *sctx,
 		u64 extent_mirror_num;
 
 		ret = find_first_extent_item(extent_root, path, cur_logical,
-					     logical + map->stripe_len - cur_logical);
+					     logical + map->stripe_len - cur_logical,
+					     NULL);
 		/* No more extent item in this data stripe */
 		if (ret > 0) {
 			ret = 0;
@@ -3385,7 +3400,8 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 			       struct map_lookup *map,
 			       u64 logical_start, u64 logical_length,
 			       struct btrfs_device *device,
-			       u64 physical, int mirror_num)
+			       u64 physical, int mirror_num,
+			       u64 *found_next)
 {
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	const u64 logical_end = logical_start + logical_length;
@@ -3437,7 +3453,8 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 		spin_unlock(&bg->lock);
 
 		ret = find_first_extent_item(extent_root, &path, cur_logical,
-					     logical_end - cur_logical);
+					     logical_end - cur_logical,
+					     found_next);
 		if (ret > 0) {
 			/* No more extent, just update the accounting */
 			sctx->stat.last_physical = physical + logical_length;
@@ -3552,6 +3569,7 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
 	int ret = 0;
 
 	while (cur_logical < bg->start + bg->length) {
+		u64 found_next = 0;
 		/*
 		 * Inside each stripe, RAID0 is just SINGLE, and RAID10 is
 		 * just RAID1, so we can reuse scrub_simple_mirror() to scrub
@@ -3559,13 +3577,23 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
 		 */
 		ret = scrub_simple_mirror(sctx, extent_root, csum_root, bg, map,
 					  cur_logical, map->stripe_len, device,
-					  cur_physical, mirror_num);
+					  cur_physical, mirror_num, &found_next);
 		if (ret)
 			return ret;
 		/* Skip to next stripe which belongs to the target device */
 		cur_logical += logical_increment;
 		/* For physical offset, we just go to next stripe */
 		cur_physical += map->stripe_len;
+
+		/*
+		 * If the next extent is still beyond our current range, we
+		 * can skip them until the @found_next.
+		 */
+		while (cur_logical + map->stripe_len < found_next &&
+		       cur_logical < bg->start + bg->length) {
+			cur_logical += logical_increment;
+			cur_physical += map->stripe_len;
+		}
 	}
 	return ret;
 }
@@ -3652,7 +3680,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		ret = scrub_simple_mirror(sctx, root, csum_root, bg, map,
 				bg->start, bg->length, scrub_dev,
 				map->stripes[stripe_index].physical,
-				stripe_index + 1);
+				stripe_index + 1, NULL);
 		offset = 0;
 		goto out;
 	}
@@ -3706,7 +3734,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		 */
 		ret = scrub_simple_mirror(sctx, root, csum_root, bg, map,
 					  logical, map->stripe_len,
-					  scrub_dev, physical, 1);
+					  scrub_dev, physical, 1, NULL);
 		if (ret < 0)
 			goto out;
 next:
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH] btrfs: fix resolving backrefs for inline extent followed by prealloc
  2023-01-09  6:11 [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes Qu Wenruo
@ 2023-01-09  6:11 ` Qu Wenruo
  2023-01-09  6:13   ` Qu Wenruo
  2023-01-27 20:56 ` [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes Boris Burkov
  2023-03-01 20:13 ` David Sterba
  2 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2023-01-09  6:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Boris Burkov, stable, David Sterba

From: Boris Burkov <boris@bur.io>

If a file consists of an inline extent followed by a regular or prealloc
extent, then a legitimate attempt to resolve a logical address in the
non-inline region will result in add_all_parents reading the invalid
offset field of the inline extent. If the inline extent item is placed
in the leaf eb s.t. it is the first item, attempting to access the
offset field will not only be meaningless, it will go past the end of
the eb and cause this panic:

  [17.626048] BTRFS warning (device dm-2): bad eb member end: ptr 0x3fd4 start 30834688 member offset 16377 size 8
  [17.631693] general protection fault, probably for non-canonical address 0x5088000000000: 0000 [#1] SMP PTI
  [17.635041] CPU: 2 PID: 1267 Comm: btrfs Not tainted 5.12.0-07246-g75175d5adc74-dirty #199
  [17.637969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [17.641995] RIP: 0010:btrfs_get_64+0xe7/0x110
  [17.649890] RSP: 0018:ffffc90001f73a08 EFLAGS: 00010202
  [17.651652] RAX: 0000000000000001 RBX: ffff88810c42d000 RCX: 0000000000000000
  [17.653921] RDX: 0005088000000000 RSI: ffffc90001f73a0f RDI: 0000000000000001
  [17.656174] RBP: 0000000000000ff9 R08: 0000000000000007 R09: c0000000fffeffff
  [17.658441] R10: ffffc90001f73790 R11: ffffc90001f73788 R12: ffff888106afe918
  [17.661070] R13: 0000000000003fd4 R14: 0000000000003f6f R15: cdcdcdcdcdcdcdcd
  [17.663617] FS:  00007f64e7627d80(0000) GS:ffff888237c80000(0000) knlGS:0000000000000000
  [17.666525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [17.668664] CR2: 000055d4a39152e8 CR3: 000000010c596002 CR4: 0000000000770ee0
  [17.671253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [17.673634] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [17.676034] PKRU: 55555554
  [17.677004] Call Trace:
  [17.677877]  add_all_parents+0x276/0x480
  [17.679325]  find_parent_nodes+0xfae/0x1590
  [17.680771]  btrfs_find_all_leafs+0x5e/0xa0
  [17.682217]  iterate_extent_inodes+0xce/0x260
  [17.683809]  ? btrfs_inode_flags_to_xflags+0x50/0x50
  [17.685597]  ? iterate_inodes_from_logical+0xa1/0xd0
  [17.687404]  iterate_inodes_from_logical+0xa1/0xd0
  [17.689121]  ? btrfs_inode_flags_to_xflags+0x50/0x50
  [17.691010]  btrfs_ioctl_logical_to_ino+0x131/0x190
  [17.692946]  btrfs_ioctl+0x104a/0x2f60
  [17.694384]  ? selinux_file_ioctl+0x182/0x220
  [17.695995]  ? __x64_sys_ioctl+0x84/0xc0
  [17.697394]  __x64_sys_ioctl+0x84/0xc0
  [17.698697]  do_syscall_64+0x33/0x40
  [17.700017]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [17.701753] RIP: 0033:0x7f64e72761b7
  [17.709355] RSP: 002b:00007ffefb067f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [17.712088] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f64e72761b7
  [17.714667] RDX: 00007ffefb067fb0 RSI: 00000000c0389424 RDI: 0000000000000003
  [17.717386] RBP: 00007ffefb06d188 R08: 000055d4a390d2b0 R09: 00007f64e7340a60
  [17.719938] R10: 0000000000000231 R11: 0000000000000246 R12: 0000000000000001
  [17.722383] R13: 0000000000000000 R14: 00000000c0389424 R15: 000055d4a38fd2a0
  [17.724839] Modules linked in:

Fix the bug by detecting the inline extent item in add_all_parents and
skipping to the next extent item.

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/backref.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 21c92c74bf71..46851511b661 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -484,6 +484,7 @@ static int add_all_parents(struct btrfs_backref_walk_ctx *ctx,
 	u64 wanted_disk_byte = ref->wanted_disk_byte;
 	u64 count = 0;
 	u64 data_offset;
+	u8 type;
 
 	if (level != 0) {
 		eb = path->nodes[level];
@@ -538,6 +539,9 @@ static int add_all_parents(struct btrfs_backref_walk_ctx *ctx,
 			continue;
 		}
 		fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
+		type = btrfs_file_extent_type(eb, fi);
+		if (type == BTRFS_FILE_EXTENT_INLINE)
+			goto next;
 		disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
 		data_offset = btrfs_file_extent_offset(eb, fi);
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fix resolving backrefs for inline extent followed by prealloc
  2023-01-09  6:11 ` [PATCH] btrfs: fix resolving backrefs for inline extent followed by prealloc Qu Wenruo
@ 2023-01-09  6:13   ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2023-01-09  6:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Boris Burkov, stable, David Sterba

Sorry, this is an internal backport.

Please ignore it.

Thanks,
Qu

On 2023/1/9 14:11, Qu Wenruo wrote:
> From: Boris Burkov <boris@bur.io>
> 
> If a file consists of an inline extent followed by a regular or prealloc
> extent, then a legitimate attempt to resolve a logical address in the
> non-inline region will result in add_all_parents reading the invalid
> offset field of the inline extent. If the inline extent item is placed
> in the leaf eb s.t. it is the first item, attempting to access the
> offset field will not only be meaningless, it will go past the end of
> the eb and cause this panic:
> 
>    [17.626048] BTRFS warning (device dm-2): bad eb member end: ptr 0x3fd4 start 30834688 member offset 16377 size 8
>    [17.631693] general protection fault, probably for non-canonical address 0x5088000000000: 0000 [#1] SMP PTI
>    [17.635041] CPU: 2 PID: 1267 Comm: btrfs Not tainted 5.12.0-07246-g75175d5adc74-dirty #199
>    [17.637969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>    [17.641995] RIP: 0010:btrfs_get_64+0xe7/0x110
>    [17.649890] RSP: 0018:ffffc90001f73a08 EFLAGS: 00010202
>    [17.651652] RAX: 0000000000000001 RBX: ffff88810c42d000 RCX: 0000000000000000
>    [17.653921] RDX: 0005088000000000 RSI: ffffc90001f73a0f RDI: 0000000000000001
>    [17.656174] RBP: 0000000000000ff9 R08: 0000000000000007 R09: c0000000fffeffff
>    [17.658441] R10: ffffc90001f73790 R11: ffffc90001f73788 R12: ffff888106afe918
>    [17.661070] R13: 0000000000003fd4 R14: 0000000000003f6f R15: cdcdcdcdcdcdcdcd
>    [17.663617] FS:  00007f64e7627d80(0000) GS:ffff888237c80000(0000) knlGS:0000000000000000
>    [17.666525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [17.668664] CR2: 000055d4a39152e8 CR3: 000000010c596002 CR4: 0000000000770ee0
>    [17.671253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [17.673634] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [17.676034] PKRU: 55555554
>    [17.677004] Call Trace:
>    [17.677877]  add_all_parents+0x276/0x480
>    [17.679325]  find_parent_nodes+0xfae/0x1590
>    [17.680771]  btrfs_find_all_leafs+0x5e/0xa0
>    [17.682217]  iterate_extent_inodes+0xce/0x260
>    [17.683809]  ? btrfs_inode_flags_to_xflags+0x50/0x50
>    [17.685597]  ? iterate_inodes_from_logical+0xa1/0xd0
>    [17.687404]  iterate_inodes_from_logical+0xa1/0xd0
>    [17.689121]  ? btrfs_inode_flags_to_xflags+0x50/0x50
>    [17.691010]  btrfs_ioctl_logical_to_ino+0x131/0x190
>    [17.692946]  btrfs_ioctl+0x104a/0x2f60
>    [17.694384]  ? selinux_file_ioctl+0x182/0x220
>    [17.695995]  ? __x64_sys_ioctl+0x84/0xc0
>    [17.697394]  __x64_sys_ioctl+0x84/0xc0
>    [17.698697]  do_syscall_64+0x33/0x40
>    [17.700017]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>    [17.701753] RIP: 0033:0x7f64e72761b7
>    [17.709355] RSP: 002b:00007ffefb067f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>    [17.712088] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f64e72761b7
>    [17.714667] RDX: 00007ffefb067fb0 RSI: 00000000c0389424 RDI: 0000000000000003
>    [17.717386] RBP: 00007ffefb06d188 R08: 000055d4a390d2b0 R09: 00007f64e7340a60
>    [17.719938] R10: 0000000000000231 R11: 0000000000000246 R12: 0000000000000001
>    [17.722383] R13: 0000000000000000 R14: 00000000c0389424 R15: 000055d4a38fd2a0
>    [17.724839] Modules linked in:
> 
> Fix the bug by detecting the inline extent item in add_all_parents and
> skipping to the next extent item.
> 
> CC: stable@vger.kernel.org # 4.9+
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Boris Burkov <boris@bur.io>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>   fs/btrfs/backref.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 21c92c74bf71..46851511b661 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -484,6 +484,7 @@ static int add_all_parents(struct btrfs_backref_walk_ctx *ctx,
>   	u64 wanted_disk_byte = ref->wanted_disk_byte;
>   	u64 count = 0;
>   	u64 data_offset;
> +	u8 type;
>   
>   	if (level != 0) {
>   		eb = path->nodes[level];
> @@ -538,6 +539,9 @@ static int add_all_parents(struct btrfs_backref_walk_ctx *ctx,
>   			continue;
>   		}
>   		fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
> +		type = btrfs_file_extent_type(eb, fi);
> +		if (type == BTRFS_FILE_EXTENT_INLINE)
> +			goto next;
>   		disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
>   		data_offset = btrfs_file_extent_offset(eb, fi);
>   

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes
  2023-01-09  6:11 [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes Qu Wenruo
  2023-01-09  6:11 ` [PATCH] btrfs: fix resolving backrefs for inline extent followed by prealloc Qu Wenruo
@ 2023-01-27 20:56 ` Boris Burkov
  2023-01-28  0:11   ` Qu Wenruo
  2023-03-01 20:13 ` David Sterba
  2 siblings, 1 reply; 6+ messages in thread
From: Boris Burkov @ 2023-01-27 20:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jan 09, 2023 at 02:11:15PM +0800, Qu Wenruo wrote:
> [BUG]
> When scrubing an empty fs with RAID0, we will call scrub_simple_mirror()
> again and again on ranges which has no extent at all.
> 
> This is especially obvious if we have both RAID0 and SINGLE.
> 
>  # mkfs.btrfs -f -m single -d raid0 $dev
>  # mount $dev $mnt
>  # xfs_io -f -c "pwrite 0 4k" $mnt/file
>  # sync
>  # btrfs scrub start -B $mnt
> 
> With extra call trace on scrub_simple_mirror(), we got the following
> trace:
> 
>   256.028473: scrub_simple_mirror: logical=1048576 len=4194304 bg=1048576 bg_len=4194304
>   256.028930: scrub_simple_mirror: logical=5242880 len=8388608 bg=5242880 bg_len=8388608
>   256.029891: scrub_simple_mirror: logical=22020096 len=65536 bg=22020096 bg_len=1073741824
>   256.029892: scrub_simple_mirror: logical=22085632 len=65536 bg=22020096 bg_len=1073741824
>   256.029893: scrub_simple_mirror: logical=22151168 len=65536 bg=22020096 bg_len=1073741824
>   ... 16K lines skipped ...
>   256.048777: scrub_simple_mirror: logical=1095630848 len=65536 bg=22020096 bg_len=1073741824
>   256.048778: scrub_simple_mirror: logical=1095696384 len=65536 bg=22020096 bg_len=1073741824
> 
> The first two lines shows we just call scrub_simple_mirror() for the
> metadata and system chunks once.
> 
> But later 16K lines are all scrub_simple_mirror() for the almost empty
> RAID0 data block group.
> 
> Most of the calls would exit very quickly since there is no extent in
> that data chunk.
> 
> [CAUSE]
> For RAID0/RAID10 we go scrub_simple_stripe() to handle the scrub for the
> block group. And since inside each stripe it's just plain SINGLE/RAID1,
> thus we reuse scrub_simple_mirror().
> 
> But there is a pitfall, that inside scrub_simple_mirror() we will do at
> least one extent tree search to find the extent in the range.
> 
> Just like above case, we can have a huge gap which has no extent in them
> at all.
> In that case, we will do extent tree search again and again, even we
> already know there is no more extent in the block group.
> 
> [FIX]
> To fix the super inefficient extent tree search, we introduce
> @found_next parameter for the following functions:
> 
> - find_first_extent_item()
> - scrub_simple_mirror()
> 
> If the function find_first_extent_item() returns 1 and @found_next
> pointer is provided, it will store the bytenr of the bytenr of the next
> extent (if at the end of the extent tree, U64_MAX is used).
> 
> So for scrub_simple_stripe(), after scrubing the current stripe and
> increased the logical bytenr, we check if our next range reaches
> @found_next.
> 
> If not, increase our @cur_logical by our increment until we reached
> @found_next.
> 
> By this, even for an almost empty RAID0 block group, we just execute
> "cur_logical += logical_increment;" 16K times, not doing tree search 16K
> times.
> 
> With the optimization, the same trace looks like this now:
> 
>   1283.376212: scrub_simple_mirror: logical=1048576 len=4194304 bg=1048576 bg_len=4194304
>   1283.376754: scrub_simple_mirror: logical=5242880 len=8388608 bg=5242880 bg_len=8388608
>   1283.377623: scrub_simple_mirror: logical=22020096 len=65536 bg=22020096 bg_len=1073741824
>   1283.377625: scrub_simple_mirror: logical=67108864 len=65536 bg=22020096 bg_len=1073741824
>   1283.377627: scrub_simple_mirror: logical=67174400 len=65536 bg=22020096 bg_len=1073741824
> 
> Note the scrub at logical 67108864, that's because the 4K write only
> lands there, not at the beginning of the data chunk (due to super block
> reserved space split the 1G chunk into two parts).
> 
> And the time duration of the chunk 22020096 is much shorter
> (18887us vs 4us).

Nice! The optimization makes sense, and LGTM.

> 
> Unfortunately this optimization only works for RAID0/RAID10 with big
> holes in the block group.
> 
> For real world cases it's much harder to find huge gaps (although we can
> still skip several stripes).
> And even for the huge gap cases, the optimization itself is hardly
> observable (less than 1 second even for an almost empty 10G block group).
> 
> And also unfortunately for RAID5 data stripes, we can not go the similar
> optimization for RAID0/RAID10 due to the extra rotation.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/scrub.c | 46 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 37 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 52b346795f66..c60cd4fd9355 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3066,7 +3066,8 @@ static int compare_extent_item_range(struct btrfs_path *path,
>   */
>  static int find_first_extent_item(struct btrfs_root *extent_root,
>  				  struct btrfs_path *path,
> -				  u64 search_start, u64 search_len)
> +				  u64 search_start, u64 search_len,
> +				  u64 *found_next)

I think at the very least, it would be nice to document the found_next
parameter in the function documentation.

Going one step further, I think the semantics could probably be
streamlined as well. I'm thinking something along the lines of always
using the path parameter to return the extent, and then the caller can
decide whether to grab the "found_next" from that before releasing the
path.

I don't see much harm in always filling in the "next" return, even if
RAID5 wants to ignore it.

Thanks,
Boris

>  {
>  	struct btrfs_fs_info *fs_info = extent_root->fs_info;
>  	struct btrfs_key key;
> @@ -3102,8 +3103,11 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
>  search_forward:
>  	while (true) {
>  		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> -		if (key.objectid >= search_start + search_len)
> +		if (key.objectid >= search_start + search_len) {
> +			if (found_next)
> +				*found_next = key.objectid;
>  			break;
> +		}
>  		if (key.type != BTRFS_METADATA_ITEM_KEY &&
>  		    key.type != BTRFS_EXTENT_ITEM_KEY)
>  			goto next;
> @@ -3111,8 +3115,11 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
>  		ret = compare_extent_item_range(path, search_start, search_len);
>  		if (ret == 0)
>  			return ret;
> -		if (ret > 0)
> +		if (ret > 0) {
> +			if (found_next)
> +				*found_next = key.objectid;
>  			break;
> +		}
>  next:
>  		path->slots[0]++;
>  		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> @@ -3120,6 +3127,13 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
>  			if (ret) {
>  				/* Either no more item or fatal error */
>  				btrfs_release_path(path);
> +
> +				/*
> +				 * No more extent tree items, set *found_next
> +				 * directly to U64_MAX.
> +				 */
> +				if (ret > 0 && found_next)
> +					*found_next = U64_MAX;
>  				return ret;
>  			}
>  		}
> @@ -3186,7 +3200,8 @@ static int scrub_raid56_data_stripe_for_parity(struct scrub_ctx *sctx,
>  		u64 extent_mirror_num;
>  
>  		ret = find_first_extent_item(extent_root, path, cur_logical,
> -					     logical + map->stripe_len - cur_logical);
> +					     logical + map->stripe_len - cur_logical,
> +					     NULL);
>  		/* No more extent item in this data stripe */
>  		if (ret > 0) {
>  			ret = 0;
> @@ -3385,7 +3400,8 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
>  			       struct map_lookup *map,
>  			       u64 logical_start, u64 logical_length,
>  			       struct btrfs_device *device,
> -			       u64 physical, int mirror_num)
> +			       u64 physical, int mirror_num,
> +			       u64 *found_next)
>  {
>  	struct btrfs_fs_info *fs_info = sctx->fs_info;
>  	const u64 logical_end = logical_start + logical_length;
> @@ -3437,7 +3453,8 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
>  		spin_unlock(&bg->lock);
>  
>  		ret = find_first_extent_item(extent_root, &path, cur_logical,
> -					     logical_end - cur_logical);
> +					     logical_end - cur_logical,
> +					     found_next);
>  		if (ret > 0) {
>  			/* No more extent, just update the accounting */
>  			sctx->stat.last_physical = physical + logical_length;
> @@ -3552,6 +3569,7 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
>  	int ret = 0;
>  
>  	while (cur_logical < bg->start + bg->length) {
> +		u64 found_next = 0;
>  		/*
>  		 * Inside each stripe, RAID0 is just SINGLE, and RAID10 is
>  		 * just RAID1, so we can reuse scrub_simple_mirror() to scrub
> @@ -3559,13 +3577,23 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
>  		 */
>  		ret = scrub_simple_mirror(sctx, extent_root, csum_root, bg, map,
>  					  cur_logical, map->stripe_len, device,
> -					  cur_physical, mirror_num);
> +					  cur_physical, mirror_num, &found_next);
>  		if (ret)
>  			return ret;
>  		/* Skip to next stripe which belongs to the target device */
>  		cur_logical += logical_increment;
>  		/* For physical offset, we just go to next stripe */
>  		cur_physical += map->stripe_len;
> +
> +		/*
> +		 * If the next extent is still beyond our current range, we
> +		 * can skip them until the @found_next.
> +		 */
> +		while (cur_logical + map->stripe_len < found_next &&
> +		       cur_logical < bg->start + bg->length) {
> +			cur_logical += logical_increment;
> +			cur_physical += map->stripe_len;
> +		}
>  	}
>  	return ret;
>  }
> @@ -3652,7 +3680,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>  		ret = scrub_simple_mirror(sctx, root, csum_root, bg, map,
>  				bg->start, bg->length, scrub_dev,
>  				map->stripes[stripe_index].physical,
> -				stripe_index + 1);
> +				stripe_index + 1, NULL);
>  		offset = 0;
>  		goto out;
>  	}
> @@ -3706,7 +3734,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>  		 */
>  		ret = scrub_simple_mirror(sctx, root, csum_root, bg, map,
>  					  logical, map->stripe_len,
> -					  scrub_dev, physical, 1);
> +					  scrub_dev, physical, 1, NULL);
>  		if (ret < 0)
>  			goto out;
>  next:
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes
  2023-01-27 20:56 ` [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes Boris Burkov
@ 2023-01-28  0:11   ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2023-01-28  0:11 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



On 2023/1/28 04:56, Boris Burkov wrote:
> On Mon, Jan 09, 2023 at 02:11:15PM +0800, Qu Wenruo wrote:
>> [BUG]
>> When scrubing an empty fs with RAID0, we will call scrub_simple_mirror()
>> again and again on ranges which has no extent at all.
>>
>> This is especially obvious if we have both RAID0 and SINGLE.
>>
>>   # mkfs.btrfs -f -m single -d raid0 $dev
>>   # mount $dev $mnt
>>   # xfs_io -f -c "pwrite 0 4k" $mnt/file
>>   # sync
>>   # btrfs scrub start -B $mnt
>>
>> With extra call trace on scrub_simple_mirror(), we got the following
>> trace:
>>
>>    256.028473: scrub_simple_mirror: logical=1048576 len=4194304 bg=1048576 bg_len=4194304
>>    256.028930: scrub_simple_mirror: logical=5242880 len=8388608 bg=5242880 bg_len=8388608
>>    256.029891: scrub_simple_mirror: logical=22020096 len=65536 bg=22020096 bg_len=1073741824
>>    256.029892: scrub_simple_mirror: logical=22085632 len=65536 bg=22020096 bg_len=1073741824
>>    256.029893: scrub_simple_mirror: logical=22151168 len=65536 bg=22020096 bg_len=1073741824
>>    ... 16K lines skipped ...
>>    256.048777: scrub_simple_mirror: logical=1095630848 len=65536 bg=22020096 bg_len=1073741824
>>    256.048778: scrub_simple_mirror: logical=1095696384 len=65536 bg=22020096 bg_len=1073741824
>>
>> The first two lines shows we just call scrub_simple_mirror() for the
>> metadata and system chunks once.
>>
>> But later 16K lines are all scrub_simple_mirror() for the almost empty
>> RAID0 data block group.
>>
>> Most of the calls would exit very quickly since there is no extent in
>> that data chunk.
>>
>> [CAUSE]
>> For RAID0/RAID10 we go scrub_simple_stripe() to handle the scrub for the
>> block group. And since inside each stripe it's just plain SINGLE/RAID1,
>> thus we reuse scrub_simple_mirror().
>>
>> But there is a pitfall, that inside scrub_simple_mirror() we will do at
>> least one extent tree search to find the extent in the range.
>>
>> Just like above case, we can have a huge gap which has no extent in them
>> at all.
>> In that case, we will do extent tree search again and again, even we
>> already know there is no more extent in the block group.
>>
>> [FIX]
>> To fix the super inefficient extent tree search, we introduce
>> @found_next parameter for the following functions:
>>
>> - find_first_extent_item()
>> - scrub_simple_mirror()
>>
>> If the function find_first_extent_item() returns 1 and @found_next
>> pointer is provided, it will store the bytenr of the bytenr of the next
>> extent (if at the end of the extent tree, U64_MAX is used).
>>
>> So for scrub_simple_stripe(), after scrubing the current stripe and
>> increased the logical bytenr, we check if our next range reaches
>> @found_next.
>>
>> If not, increase our @cur_logical by our increment until we reached
>> @found_next.
>>
>> By this, even for an almost empty RAID0 block group, we just execute
>> "cur_logical += logical_increment;" 16K times, not doing tree search 16K
>> times.
>>
>> With the optimization, the same trace looks like this now:
>>
>>    1283.376212: scrub_simple_mirror: logical=1048576 len=4194304 bg=1048576 bg_len=4194304
>>    1283.376754: scrub_simple_mirror: logical=5242880 len=8388608 bg=5242880 bg_len=8388608
>>    1283.377623: scrub_simple_mirror: logical=22020096 len=65536 bg=22020096 bg_len=1073741824
>>    1283.377625: scrub_simple_mirror: logical=67108864 len=65536 bg=22020096 bg_len=1073741824
>>    1283.377627: scrub_simple_mirror: logical=67174400 len=65536 bg=22020096 bg_len=1073741824
>>
>> Note the scrub at logical 67108864, that's because the 4K write only
>> lands there, not at the beginning of the data chunk (due to super block
>> reserved space split the 1G chunk into two parts).
>>
>> And the time duration of the chunk 22020096 is much shorter
>> (18887us vs 4us).
> 
> Nice! The optimization makes sense, and LGTM.
> 
>>
>> Unfortunately this optimization only works for RAID0/RAID10 with big
>> holes in the block group.
>>
>> For real world cases it's much harder to find huge gaps (although we can
>> still skip several stripes).
>> And even for the huge gap cases, the optimization itself is hardly
>> observable (less than 1 second even for an almost empty 10G block group).
>>
>> And also unfortunately for RAID5 data stripes, we can not go the similar
>> optimization for RAID0/RAID10 due to the extra rotation.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/scrub.c | 46 +++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 37 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index 52b346795f66..c60cd4fd9355 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3066,7 +3066,8 @@ static int compare_extent_item_range(struct btrfs_path *path,
>>    */
>>   static int find_first_extent_item(struct btrfs_root *extent_root,
>>   				  struct btrfs_path *path,
>> -				  u64 search_start, u64 search_len)
>> +				  u64 search_start, u64 search_len,
>> +				  u64 *found_next)
> 
> I think at the very least, it would be nice to document the found_next
> parameter in the function documentation.
> 
> Going one step further, I think the semantics could probably be
> streamlined as well. I'm thinking something along the lines of always
> using the path parameter to return the extent, and then the caller can
> decide whether to grab the "found_next" from that before releasing the
> path.

That sounds pretty reasonable.

Let me explore this idea more, one less argument is always a good thing.

Thanks,
Qu
> 
> I don't see much harm in always filling in the "next" return, even if
> RAID5 wants to ignore it.
> 
> Thanks,
> Boris
> 
>>   {
>>   	struct btrfs_fs_info *fs_info = extent_root->fs_info;
>>   	struct btrfs_key key;
>> @@ -3102,8 +3103,11 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
>>   search_forward:
>>   	while (true) {
>>   		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> -		if (key.objectid >= search_start + search_len)
>> +		if (key.objectid >= search_start + search_len) {
>> +			if (found_next)
>> +				*found_next = key.objectid;
>>   			break;
>> +		}
>>   		if (key.type != BTRFS_METADATA_ITEM_KEY &&
>>   		    key.type != BTRFS_EXTENT_ITEM_KEY)
>>   			goto next;
>> @@ -3111,8 +3115,11 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
>>   		ret = compare_extent_item_range(path, search_start, search_len);
>>   		if (ret == 0)
>>   			return ret;
>> -		if (ret > 0)
>> +		if (ret > 0) {
>> +			if (found_next)
>> +				*found_next = key.objectid;
>>   			break;
>> +		}
>>   next:
>>   		path->slots[0]++;
>>   		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
>> @@ -3120,6 +3127,13 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
>>   			if (ret) {
>>   				/* Either no more item or fatal error */
>>   				btrfs_release_path(path);
>> +
>> +				/*
>> +				 * No more extent tree items, set *found_next
>> +				 * directly to U64_MAX.
>> +				 */
>> +				if (ret > 0 && found_next)
>> +					*found_next = U64_MAX;
>>   				return ret;
>>   			}
>>   		}
>> @@ -3186,7 +3200,8 @@ static int scrub_raid56_data_stripe_for_parity(struct scrub_ctx *sctx,
>>   		u64 extent_mirror_num;
>>   
>>   		ret = find_first_extent_item(extent_root, path, cur_logical,
>> -					     logical + map->stripe_len - cur_logical);
>> +					     logical + map->stripe_len - cur_logical,
>> +					     NULL);
>>   		/* No more extent item in this data stripe */
>>   		if (ret > 0) {
>>   			ret = 0;
>> @@ -3385,7 +3400,8 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
>>   			       struct map_lookup *map,
>>   			       u64 logical_start, u64 logical_length,
>>   			       struct btrfs_device *device,
>> -			       u64 physical, int mirror_num)
>> +			       u64 physical, int mirror_num,
>> +			       u64 *found_next)
>>   {
>>   	struct btrfs_fs_info *fs_info = sctx->fs_info;
>>   	const u64 logical_end = logical_start + logical_length;
>> @@ -3437,7 +3453,8 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
>>   		spin_unlock(&bg->lock);
>>   
>>   		ret = find_first_extent_item(extent_root, &path, cur_logical,
>> -					     logical_end - cur_logical);
>> +					     logical_end - cur_logical,
>> +					     found_next);
>>   		if (ret > 0) {
>>   			/* No more extent, just update the accounting */
>>   			sctx->stat.last_physical = physical + logical_length;
>> @@ -3552,6 +3569,7 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
>>   	int ret = 0;
>>   
>>   	while (cur_logical < bg->start + bg->length) {
>> +		u64 found_next = 0;
>>   		/*
>>   		 * Inside each stripe, RAID0 is just SINGLE, and RAID10 is
>>   		 * just RAID1, so we can reuse scrub_simple_mirror() to scrub
>> @@ -3559,13 +3577,23 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
>>   		 */
>>   		ret = scrub_simple_mirror(sctx, extent_root, csum_root, bg, map,
>>   					  cur_logical, map->stripe_len, device,
>> -					  cur_physical, mirror_num);
>> +					  cur_physical, mirror_num, &found_next);
>>   		if (ret)
>>   			return ret;
>>   		/* Skip to next stripe which belongs to the target device */
>>   		cur_logical += logical_increment;
>>   		/* For physical offset, we just go to next stripe */
>>   		cur_physical += map->stripe_len;
>> +
>> +		/*
>> +		 * If the next extent is still beyond our current range, we
>> +		 * can skip them until the @found_next.
>> +		 */
>> +		while (cur_logical + map->stripe_len < found_next &&
>> +		       cur_logical < bg->start + bg->length) {
>> +			cur_logical += logical_increment;
>> +			cur_physical += map->stripe_len;
>> +		}
>>   	}
>>   	return ret;
>>   }
>> @@ -3652,7 +3680,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>>   		ret = scrub_simple_mirror(sctx, root, csum_root, bg, map,
>>   				bg->start, bg->length, scrub_dev,
>>   				map->stripes[stripe_index].physical,
>> -				stripe_index + 1);
>> +				stripe_index + 1, NULL);
>>   		offset = 0;
>>   		goto out;
>>   	}
>> @@ -3706,7 +3734,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>>   		 */
>>   		ret = scrub_simple_mirror(sctx, root, csum_root, bg, map,
>>   					  logical, map->stripe_len,
>> -					  scrub_dev, physical, 1);
>> +					  scrub_dev, physical, 1, NULL);
>>   		if (ret < 0)
>>   			goto out;
>>   next:
>> -- 
>> 2.39.0
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes
  2023-01-09  6:11 [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes Qu Wenruo
  2023-01-09  6:11 ` [PATCH] btrfs: fix resolving backrefs for inline extent followed by prealloc Qu Wenruo
  2023-01-27 20:56 ` [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes Boris Burkov
@ 2023-03-01 20:13 ` David Sterba
  2 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2023-03-01 20:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jan 09, 2023 at 02:11:15PM +0800, Qu Wenruo wrote:
> [BUG]
> When scrubing an empty fs with RAID0, we will call scrub_simple_mirror()
> again and again on ranges which has no extent at all.
> 
> This is especially obvious if we have both RAID0 and SINGLE.
> 
>  # mkfs.btrfs -f -m single -d raid0 $dev
>  # mount $dev $mnt
>  # xfs_io -f -c "pwrite 0 4k" $mnt/file
>  # sync
>  # btrfs scrub start -B $mnt
> 
> With extra call trace on scrub_simple_mirror(), we got the following
> trace:
> 
>   256.028473: scrub_simple_mirror: logical=1048576 len=4194304 bg=1048576 bg_len=4194304
>   256.028930: scrub_simple_mirror: logical=5242880 len=8388608 bg=5242880 bg_len=8388608
>   256.029891: scrub_simple_mirror: logical=22020096 len=65536 bg=22020096 bg_len=1073741824
>   256.029892: scrub_simple_mirror: logical=22085632 len=65536 bg=22020096 bg_len=1073741824
>   256.029893: scrub_simple_mirror: logical=22151168 len=65536 bg=22020096 bg_len=1073741824
>   ... 16K lines skipped ...
>   256.048777: scrub_simple_mirror: logical=1095630848 len=65536 bg=22020096 bg_len=1073741824
>   256.048778: scrub_simple_mirror: logical=1095696384 len=65536 bg=22020096 bg_len=1073741824
> 
> The first two lines shows we just call scrub_simple_mirror() for the
> metadata and system chunks once.
> 
> But later 16K lines are all scrub_simple_mirror() for the almost empty
> RAID0 data block group.
> 
> Most of the calls would exit very quickly since there is no extent in
> that data chunk.
> 
> [CAUSE]
> For RAID0/RAID10 we go scrub_simple_stripe() to handle the scrub for the
> block group. And since inside each stripe it's just plain SINGLE/RAID1,
> thus we reuse scrub_simple_mirror().
> 
> But there is a pitfall, that inside scrub_simple_mirror() we will do at
> least one extent tree search to find the extent in the range.
> 
> Just like above case, we can have a huge gap which has no extent in them
> at all.
> In that case, we will do extent tree search again and again, even we
> already know there is no more extent in the block group.
> 
> [FIX]
> To fix the super inefficient extent tree search, we introduce
> @found_next parameter for the following functions:
> 
> - find_first_extent_item()
> - scrub_simple_mirror()
> 
> If the function find_first_extent_item() returns 1 and @found_next
> pointer is provided, it will store the bytenr of the bytenr of the next
> extent (if at the end of the extent tree, U64_MAX is used).
> 
> So for scrub_simple_stripe(), after scrubing the current stripe and
> increased the logical bytenr, we check if our next range reaches
> @found_next.
> 
> If not, increase our @cur_logical by our increment until we reached
> @found_next.
> 
> By this, even for an almost empty RAID0 block group, we just execute
> "cur_logical += logical_increment;" 16K times, not doing tree search 16K
> times.
> 
> With the optimization, the same trace looks like this now:
> 
>   1283.376212: scrub_simple_mirror: logical=1048576 len=4194304 bg=1048576 bg_len=4194304
>   1283.376754: scrub_simple_mirror: logical=5242880 len=8388608 bg=5242880 bg_len=8388608
>   1283.377623: scrub_simple_mirror: logical=22020096 len=65536 bg=22020096 bg_len=1073741824
>   1283.377625: scrub_simple_mirror: logical=67108864 len=65536 bg=22020096 bg_len=1073741824
>   1283.377627: scrub_simple_mirror: logical=67174400 len=65536 bg=22020096 bg_len=1073741824
> 
> Note the scrub at logical 67108864, that's because the 4K write only
> lands there, not at the beginning of the data chunk (due to super block
> reserved space split the 1G chunk into two parts).
> 
> And the time duration of the chunk 22020096 is much shorter
> (18887us vs 4us).
> 
> Unfortunately this optimization only works for RAID0/RAID10 with big
> holes in the block group.
> 
> For real world cases it's much harder to find huge gaps (although we can
> still skip several stripes).
> And even for the huge gap cases, the optimization itself is hardly
> observable (less than 1 second even for an almost empty 10G block group).
> 
> And also unfortunately for RAID5 data stripes, we can not go the similar
> optimization for RAID0/RAID10 due to the extra rotation.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

This patch does not apply cleanly and I'm not sure if you were planning
an followup or not. If yes, please resend, thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-01 20:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-09  6:11 [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes Qu Wenruo
2023-01-09  6:11 ` [PATCH] btrfs: fix resolving backrefs for inline extent followed by prealloc Qu Wenruo
2023-01-09  6:13   ` Qu Wenruo
2023-01-27 20:56 ` [PATCH] btrfs: scrub: avoid unnecessary extent tree search for simple stripes Boris Burkov
2023-01-28  0:11   ` Qu Wenruo
2023-03-01 20:13 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.