All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] btrfs: scrub: improve the scrub performance
@ 2023-07-28 11:14 Qu Wenruo
  2023-07-28 11:14 ` [PATCH 1/5] btrfs: scrub: avoid unnecessary extent tree search preparing stripes Qu Wenruo
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-07-28 11:14 UTC (permalink / raw)
  To: linux-btrfs

[REPO]
https://github.com/adam900710/linux/tree/scrub_testing

[CHANGELOG]
v1:
- Rebased to latest misc-next

- Rework the read IO grouping patch
  David has found some crashes mostly related to scrub performance
  fixes, meanwhile the original grouping patch has one extra flag,
  SCRUB_FLAG_READ_SUBMITTED, to avoid double submitting.

  But this flag can be avoided as we can easily avoid double submitting
  just by properly checking the sctx->nr_stripe variable.

  This reworked grouping read IO patch should be safer compared to the
  initial version, with better code structure.

  Unfortunately, the final performance is worse than the initial version
  (2.2GiB/s vs 2.5GiB/s), but it should be less racy thus safer.

- Re-order the patches
  The first 3 patches are the main fixes, and I put safer patches first,
  so even if David still found crash at certain patch, the remaining can
  be dropped if needed.

There is a huge scrub performance drop introduced by v6.4 kernel, that 
the scrub performance is only around 1/3 for large data extents.

There are several causes:

- Missing blk plug
  This means read requests won't be merged by block layer, this can
  hugely reduce the read performance.

- Extra time spent on extent/csum tree search
  This including extra path allocation/freeing and tree searchs.
  This is especially obvious for large data extents, as previously we
  only do one csum search per 512K, but now we do one csum search per
  64K, an 8 times increase in csum tree search.

- Less concurrency
  Mostly due to the fact that we're doing submit-and-wait, thus much
  lower queue depth, affecting things like NVME which benefits a lot
  from high concurrency.

The first 3 patches would greately improve the scrub read performance,
but unfortunately it's still not as fast as the pre-6.4 kernels.
(2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels (2.2GiB
vs 1.0GiB/s).

Qu Wenruo (5):
  btrfs: scrub: avoid unnecessary extent tree search preparing stripes
  btrfs: scrub: avoid unnecessary csum tree search preparing stripes
  btrfs: scrub: fix grouping of read IO
  btrfs: scrub: don't go ordered workqueue for dev-replace
  btrfs: scrub: move write back of repaired sectors into
    scrub_stripe_read_repair_worker()

 fs/btrfs/file-item.c |  33 +++---
 fs/btrfs/file-item.h |   6 +-
 fs/btrfs/raid56.c    |   4 +-
 fs/btrfs/scrub.c     | 234 ++++++++++++++++++++++++++-----------------
 4 files changed, 169 insertions(+), 108 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] btrfs: scrub: avoid unnecessary extent tree search preparing stripes
  2023-07-28 11:14 [PATCH 0/5] btrfs: scrub: improve the scrub performance Qu Wenruo
@ 2023-07-28 11:14 ` Qu Wenruo
  2023-07-28 11:14 ` [PATCH 2/5] btrfs: scrub: avoid unnecessary csum " Qu Wenruo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-07-28 11:14 UTC (permalink / raw)
  To: linux-btrfs

Since commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer re-use the same path
for extent tree search.

This can lead to unnecessary extent tree search, especially for the new
stripe based scrub, as we have way more stripes to prepare.

This patch would re-introduce a shared path for extent tree search, and
properly release it when the block group is scrubbed.

This change alone can improve scrub performance slightly by reducing the
time spend preparing the stripe thus improving the queue depth.

Before (with regression):

 Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
 nvme0n1p3 15578.00  993616.00     5.00   0.03    0.09    63.78   1.32 100.00

After (with this patch):

 nvme0n1p3 15875.00 1013328.00    12.00   0.08    0.08    63.83   1.35 100.00

Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 41 +++++++++++++++++++++++++++++------------
 1 file changed, 29 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index db076e12f442..d10afe94096f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -175,6 +175,7 @@ struct scrub_ctx {
 	struct scrub_stripe	stripes[SCRUB_STRIPES_PER_SCTX];
 	struct scrub_stripe	*raid56_data_stripes;
 	struct btrfs_fs_info	*fs_info;
+	struct btrfs_path	extent_path;
 	int			first_free;
 	int			cur_stripe;
 	atomic_t		cancel_req;
@@ -339,6 +340,8 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
 	refcount_set(&sctx->refs, 1);
 	sctx->is_dev_replace = is_dev_replace;
 	sctx->fs_info = fs_info;
+	sctx->extent_path.search_commit_root = 1;
+	sctx->extent_path.skip_locking = 1;
 	for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++) {
 		int ret;
 
@@ -1466,6 +1469,7 @@ static void scrub_stripe_reset_bitmaps(struct scrub_stripe *stripe)
  * Return <0 for error.
  */
 static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
+					struct btrfs_path *extent_path,
 					struct btrfs_device *dev, u64 physical,
 					int mirror_num, u64 logical_start,
 					u32 logical_len,
@@ -1475,7 +1479,6 @@ static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
 	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bg->start);
 	struct btrfs_root *csum_root = btrfs_csum_root(fs_info, bg->start);
 	const u64 logical_end = logical_start + logical_len;
-	struct btrfs_path path = { 0 };
 	u64 cur_logical = logical_start;
 	u64 stripe_end;
 	u64 extent_start;
@@ -1491,14 +1494,13 @@ static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
 	/* The range must be inside the bg. */
 	ASSERT(logical_start >= bg->start && logical_end <= bg->start + bg->length);
 
-	path.search_commit_root = 1;
-	path.skip_locking = 1;
-
-	ret = find_first_extent_item(extent_root, &path, logical_start, logical_len);
+	ret = find_first_extent_item(extent_root, extent_path, logical_start,
+				     logical_len);
 	/* Either error or not found. */
 	if (ret)
 		goto out;
-	get_extent_info(&path, &extent_start, &extent_len, &extent_flags, &extent_gen);
+	get_extent_info(extent_path, &extent_start, &extent_len, &extent_flags,
+			&extent_gen);
 	if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)
 		stripe->nr_meta_extents++;
 	if (extent_flags & BTRFS_EXTENT_FLAG_DATA)
@@ -1526,7 +1528,7 @@ static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
 
 	/* Fill the extent info for the remaining sectors. */
 	while (cur_logical <= stripe_end) {
-		ret = find_first_extent_item(extent_root, &path, cur_logical,
+		ret = find_first_extent_item(extent_root, extent_path, cur_logical,
 					     stripe_end - cur_logical + 1);
 		if (ret < 0)
 			goto out;
@@ -1534,7 +1536,7 @@ static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
 			ret = 0;
 			break;
 		}
-		get_extent_info(&path, &extent_start, &extent_len,
+		get_extent_info(extent_path, &extent_start, &extent_len,
 				&extent_flags, &extent_gen);
 		if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)
 			stripe->nr_meta_extents++;
@@ -1574,7 +1576,6 @@ static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
 	}
 	set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state);
 out:
-	btrfs_release_path(&path);
 	return ret;
 }
 
@@ -1764,8 +1765,9 @@ static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *
 
 	/* We can queue one stripe using the remaining slot. */
 	scrub_reset_stripe(stripe);
-	ret = scrub_find_fill_first_stripe(bg, dev, physical, mirror_num,
-					   logical, length, stripe);
+	ret = scrub_find_fill_first_stripe(bg, &sctx->extent_path, dev,
+					   physical, mirror_num, logical,
+					   length, stripe);
 	/* Either >0 as no more extents or <0 for error. */
 	if (ret)
 		return ret;
@@ -1783,6 +1785,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	struct btrfs_raid_bio *rbio;
 	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_path extent_path = { 0 };
 	struct bio *bio;
 	struct scrub_stripe *stripe;
 	bool all_empty = true;
@@ -1793,6 +1796,14 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 
 	ASSERT(sctx->raid56_data_stripes);
 
+	/*
+	 * For data stripe search, we can not re-use the same extent path, as
+	 * the data stripe bytenr may be smaller than previous extent.
+	 * Thus we have to use our own extent path.
+	 */
+	extent_path.search_commit_root = 1;
+	extent_path.skip_locking = 1;
+
 	for (int i = 0; i < data_stripes; i++) {
 		int stripe_index;
 		int rot;
@@ -1807,7 +1818,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 
 		scrub_reset_stripe(stripe);
 		set_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state);
-		ret = scrub_find_fill_first_stripe(bg,
+		ret = scrub_find_fill_first_stripe(bg, &extent_path,
 				map->stripes[stripe_index].dev, physical, 1,
 				full_stripe_start + btrfs_stripe_nr_to_offset(i),
 				BTRFS_STRIPE_LEN, stripe);
@@ -1935,6 +1946,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 	bio_put(bio);
 	btrfs_bio_counter_dec(fs_info);
 
+	btrfs_release_path(&extent_path);
 out:
 	return ret;
 }
@@ -2102,6 +2114,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	u64 stripe_logical;
 	int stop_loop = 0;
 
+	/* Extent_path should be probably released. */
+	ASSERT(sctx->extent_path.nodes[0] == NULL);
+
 	scrub_blocked_if_needed(fs_info);
 
 	if (sctx->is_dev_replace &&
@@ -2220,6 +2235,8 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	ret2 = flush_scrub_stripes(sctx);
 	if (!ret)
 		ret = ret2;
+	btrfs_release_path(&sctx->extent_path);
+
 	if (sctx->raid56_data_stripes) {
 		for (int i = 0; i < nr_data_stripes(map); i++)
 			release_scrub_stripe(&sctx->raid56_data_stripes[i]);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] btrfs: scrub: avoid unnecessary csum tree search preparing stripes
  2023-07-28 11:14 [PATCH 0/5] btrfs: scrub: improve the scrub performance Qu Wenruo
  2023-07-28 11:14 ` [PATCH 1/5] btrfs: scrub: avoid unnecessary extent tree search preparing stripes Qu Wenruo
@ 2023-07-28 11:14 ` Qu Wenruo
  2023-07-28 11:14 ` [PATCH 3/5] btrfs: scrub: fix grouping of read IO Qu Wenruo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-07-28 11:14 UTC (permalink / raw)
  To: linux-btrfs

One of the bottleneck of the new scrub code is the extra csum tree
search.

The old code would only do the csum tree search for each scrub bio,
which can be as large as 512KiB, thus they can afford to allocate a new
patch each time.

But the new scrub code is doing csum tree search for each stripe, which
is only 64KiB, this means we'd better re-use the same csum path during
each search.

This patch would introduce a per-sctx path for csum tree search, as we
don't need to re-allocate the path every time we need to do a csum tree
search.

With this change we can further improve the queue depth and improve the
scrub read performance:

Before (with regression and cached extent tree path):

 Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
 nvme0n1p3 15875.00 1013328.00    12.00   0.08    0.08    63.83   1.35 100.00

After (with both cached extent/csum tree path):

 nvme0n1p3 17759.00 1133280.00    10.00   0.06    0.08    63.81   1.50 100.00

Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file-item.c | 33 +++++++++++++++++++++------------
 fs/btrfs/file-item.h |  6 +++---
 fs/btrfs/raid56.c    |  4 ++--
 fs/btrfs/scrub.c     | 29 +++++++++++++++++++----------
 4 files changed, 45 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 696bf695d8eb..21ee77d26c23 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -597,29 +597,36 @@ int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
  * Each bit represents a sector. Thus caller should ensure @csum_buf passed
  * in is large enough to contain all csums.
  */
-int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end,
-			      u8 *csum_buf, unsigned long *csum_bitmap,
-			      bool search_commit)
+int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
+			      u64 start, u64 end, u8 *csum_buf,
+			      unsigned long *csum_bitmap)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_key key;
-	struct btrfs_path *path;
 	struct extent_buffer *leaf;
 	struct btrfs_csum_item *item;
 	const u64 orig_start = start;
+	bool free_path = false;
 	int ret;
 
 	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
 	       IS_ALIGNED(end + 1, fs_info->sectorsize));
 
-	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
+	if (!path) {
+		path = btrfs_alloc_path();
+		if (!path)
+			return -ENOMEM;
+		free_path = true;
+	}
 
-	if (search_commit) {
-		path->skip_locking = 1;
-		path->reada = READA_FORWARD;
-		path->search_commit_root = 1;
+	/* Check if we can reuse the previous path. */
+	if (path->nodes[0]) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
+		    key.type == BTRFS_EXTENT_CSUM_KEY && key.offset <= start)
+			goto search_forward;
+		btrfs_release_path(path);
 	}
 
 	key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
@@ -656,6 +663,7 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end,
 		}
 	}
 
+search_forward:
 	while (start <= end) {
 		u64 csum_end;
 
@@ -712,7 +720,8 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end,
 	}
 	ret = 0;
 fail:
-	btrfs_free_path(path);
+	if (free_path)
+		btrfs_free_path(path);
 	return ret;
 }
 
diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
index 4ec669b69008..04bd2d34efb1 100644
--- a/fs/btrfs/file-item.h
+++ b/fs/btrfs/file-item.h
@@ -57,9 +57,9 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
 			    struct list_head *list, int search_commit,
 			    bool nowait);
-int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end,
-			      u8 *csum_buf, unsigned long *csum_bitmap,
-			      bool search_commit);
+int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
+			      u64 start, u64 end, u8 *csum_buf,
+			      unsigned long *csum_bitmap);
 void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 				     const struct btrfs_path *path,
 				     struct btrfs_file_extent_item *fi,
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index f97fc62fbab2..3e014b9370a3 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2105,8 +2105,8 @@ static void fill_data_csums(struct btrfs_raid_bio *rbio)
 		goto error;
 	}
 
-	ret = btrfs_lookup_csums_bitmap(csum_root, start, start + len - 1,
-					rbio->csum_buf, rbio->csum_bitmap, false);
+	ret = btrfs_lookup_csums_bitmap(csum_root, NULL, start, start + len - 1,
+					rbio->csum_buf, rbio->csum_bitmap);
 	if (ret < 0)
 		goto error;
 	if (bitmap_empty(rbio->csum_bitmap, len >> fs_info->sectorsize_bits))
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index d10afe94096f..1748735f72c6 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -176,6 +176,7 @@ struct scrub_ctx {
 	struct scrub_stripe	*raid56_data_stripes;
 	struct btrfs_fs_info	*fs_info;
 	struct btrfs_path	extent_path;
+	struct btrfs_path	csum_path;
 	int			first_free;
 	int			cur_stripe;
 	atomic_t		cancel_req;
@@ -342,6 +343,8 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
 	sctx->fs_info = fs_info;
 	sctx->extent_path.search_commit_root = 1;
 	sctx->extent_path.skip_locking = 1;
+	sctx->csum_path.search_commit_root = 1;
+	sctx->csum_path.skip_locking = 1;
 	for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++) {
 		int ret;
 
@@ -1470,6 +1473,7 @@ static void scrub_stripe_reset_bitmaps(struct scrub_stripe *stripe)
  */
 static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
 					struct btrfs_path *extent_path,
+					struct btrfs_path *csum_path,
 					struct btrfs_device *dev, u64 physical,
 					int mirror_num, u64 logical_start,
 					u32 logical_len,
@@ -1561,9 +1565,9 @@ static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
 		 */
 		ASSERT(BITS_PER_LONG >= BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits);
 
-		ret = btrfs_lookup_csums_bitmap(csum_root, stripe->logical,
-						stripe_end, stripe->csums,
-						&csum_bitmap, true);
+		ret = btrfs_lookup_csums_bitmap(csum_root, csum_path,
+						stripe->logical, stripe_end,
+						stripe->csums, &csum_bitmap);
 		if (ret < 0)
 			goto out;
 		if (ret > 0)
@@ -1765,9 +1769,9 @@ static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *
 
 	/* We can queue one stripe using the remaining slot. */
 	scrub_reset_stripe(stripe);
-	ret = scrub_find_fill_first_stripe(bg, &sctx->extent_path, dev,
-					   physical, mirror_num, logical,
-					   length, stripe);
+	ret = scrub_find_fill_first_stripe(bg, &sctx->extent_path,
+			&sctx->csum_path, dev, physical, mirror_num, logical,
+			length, stripe);
 	/* Either >0 as no more extents or <0 for error. */
 	if (ret)
 		return ret;
@@ -1786,6 +1790,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 	struct btrfs_raid_bio *rbio;
 	struct btrfs_io_context *bioc = NULL;
 	struct btrfs_path extent_path = { 0 };
+	struct btrfs_path csum_path = { 0 };
 	struct bio *bio;
 	struct scrub_stripe *stripe;
 	bool all_empty = true;
@@ -1797,12 +1802,14 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 	ASSERT(sctx->raid56_data_stripes);
 
 	/*
-	 * For data stripe search, we can not re-use the same extent path, as
-	 * the data stripe bytenr may be smaller than previous extent.
-	 * Thus we have to use our own extent path.
+	 * For data stripe search, we can not re-use the same extent/csum paths,
+	 * as the data stripe bytenr may be smaller than previous extent.
+	 * Thus we have to use our own extent/csum paths.
 	 */
 	extent_path.search_commit_root = 1;
 	extent_path.skip_locking = 1;
+	csum_path.search_commit_root = 1;
+	csum_path.skip_locking = 1;
 
 	for (int i = 0; i < data_stripes; i++) {
 		int stripe_index;
@@ -1818,7 +1825,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 
 		scrub_reset_stripe(stripe);
 		set_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state);
-		ret = scrub_find_fill_first_stripe(bg, &extent_path,
+		ret = scrub_find_fill_first_stripe(bg, &extent_path, &csum_path,
 				map->stripes[stripe_index].dev, physical, 1,
 				full_stripe_start + btrfs_stripe_nr_to_offset(i),
 				BTRFS_STRIPE_LEN, stripe);
@@ -1947,6 +1954,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 	btrfs_bio_counter_dec(fs_info);
 
 	btrfs_release_path(&extent_path);
+	btrfs_release_path(&csum_path);
 out:
 	return ret;
 }
@@ -2236,6 +2244,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	if (!ret)
 		ret = ret2;
 	btrfs_release_path(&sctx->extent_path);
+	btrfs_release_path(&sctx->csum_path);
 
 	if (sctx->raid56_data_stripes) {
 		for (int i = 0; i < nr_data_stripes(map); i++)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] btrfs: scrub: fix grouping of read IO
  2023-07-28 11:14 [PATCH 0/5] btrfs: scrub: improve the scrub performance Qu Wenruo
  2023-07-28 11:14 ` [PATCH 1/5] btrfs: scrub: avoid unnecessary extent tree search preparing stripes Qu Wenruo
  2023-07-28 11:14 ` [PATCH 2/5] btrfs: scrub: avoid unnecessary csum " Qu Wenruo
@ 2023-07-28 11:14 ` Qu Wenruo
  2023-07-28 11:14 ` [PATCH 4/5] btrfs: scrub: don't go ordered workqueue for dev-replace Qu Wenruo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-07-28 11:14 UTC (permalink / raw)
  To: linux-btrfs

[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.

On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.

[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:

 Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
 nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18 100.00
 nvme0n1p3 15578.00  993616.00     5.00   0.03    0.09    63.78   1.32 100.00

The upper one is v6.3 while the lower one is v6.4.

There are several obvious differences:

- Very few read merges
  This turns out to be a behavior change that we no longer go bio
  plug/unplug.

- Very low aqu-sz
  This is due to the submit-and-wait behavior of flush_scrub_stripes(),
  and extra extent/csum tree search.

Both behavior is not that obvious on SATA SSDs, as SATA SSDs has NCQ to
merge the reads, while SATA SSDs can not handle high queue depth well
either.

[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.

For the read part, we go two directions to fix the problems:

- Re-introduce blk plug/unplug to merge read requests
  This is pretty simple, and the behavior is pretty easy to observe.

  This would enlarge the average read request size to 512K.

- Introduce multi-group reads and no longer waits for each group
  Instead of the old behavior, which submits 8 stripes and waits for
  them, here we would enlarge the total number of stripes to 16 * 8.
  Which is 8M per device, the same limit as the old scrub in-flight
  bios size limit.

  Now every time we fill a group (8 stripes), we submit them and
  continue to next stripes.

  Only when the full 16 * 8 stripes are all filled, we submit the
  remaining ones (the last group), and wait for all groups to finish.
  Then submit the repair writes and dev-replace writes.

  This should enlarge the queue depth.

This would greatly improve the merge rate (thus read block size) and
queue depth:

Before (with regression, and cached extent/csum path):

 Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
 nvme0n1p3 20666.00 1318240.00    10.00   0.05    0.08    63.79   1.63 100.00

After (with all patches applied):

 nvme0n1p3  5165.00 2278304.00 30557.00  85.54    0.55   441.10   2.81 100.00
---
 fs/btrfs/scrub.c | 95 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 72 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 1748735f72c6..bfa8feee257f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -43,9 +43,21 @@ struct scrub_ctx;
 /*
  * The following value only influences the performance.
  *
- * This determines the batch size for stripe submitted in one go.
+ * This detemines how many stripes would be submitted in one go,
+ * this would 512KiB (BTRFS_STRIPE_LEN * SCRUB_STRIPES_PER_GROUP).
  */
-#define SCRUB_STRIPES_PER_SCTX	8	/* That would be 8 64K stripe per-device. */
+#define SCRUB_STRIPES_PER_GROUP		8
+
+/*
+ * How many groups we have for each sctx.
+ *
+ * This would be 8M per device, the same value as the old scrub in-flight bios
+ * size limit.
+ */
+#define SCRUB_GROUPS_PER_SCTX		16
+
+#define SCRUB_TOTAL_STRIPES		(SCRUB_GROUPS_PER_SCTX * \
+					 SCRUB_STRIPES_PER_GROUP)
 
 /*
  * The following value times PAGE_SIZE needs to be large enough to match the
@@ -172,7 +184,7 @@ struct scrub_stripe {
 };
 
 struct scrub_ctx {
-	struct scrub_stripe	stripes[SCRUB_STRIPES_PER_SCTX];
+	struct scrub_stripe	stripes[SCRUB_TOTAL_STRIPES];
 	struct scrub_stripe	*raid56_data_stripes;
 	struct btrfs_fs_info	*fs_info;
 	struct btrfs_path	extent_path;
@@ -317,7 +329,7 @@ static noinline_for_stack void scrub_free_ctx(struct scrub_ctx *sctx)
 	if (!sctx)
 		return;
 
-	for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++)
+	for (i = 0; i < SCRUB_TOTAL_STRIPES; i++)
 		release_scrub_stripe(&sctx->stripes[i]);
 
 	kfree(sctx);
@@ -345,7 +357,7 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
 	sctx->extent_path.skip_locking = 1;
 	sctx->csum_path.search_commit_root = 1;
 	sctx->csum_path.skip_locking = 1;
-	for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++) {
+	for (i = 0; i < SCRUB_TOTAL_STRIPES; i++) {
 		int ret;
 
 		ret = init_scrub_stripe(fs_info, &sctx->stripes[i]);
@@ -1657,6 +1669,29 @@ static bool stripe_has_metadata_error(struct scrub_stripe *stripe)
 	return false;
 }
 
+static void submit_initial_group_read(struct scrub_ctx *sctx,
+				      unsigned int first_slot,
+				      unsigned int nr_stripes)
+{
+	struct blk_plug plug;
+
+	ASSERT(first_slot < SCRUB_TOTAL_STRIPES);
+	ASSERT(first_slot + nr_stripes <= SCRUB_TOTAL_STRIPES);
+
+	scrub_throttle_dev_io(sctx, sctx->stripes[0].dev,
+			      btrfs_stripe_nr_to_offset(nr_stripes));
+	blk_start_plug(&plug);
+	for (int i = 0; i < nr_stripes; i++) {
+		struct scrub_stripe *stripe = &sctx->stripes[first_slot + i];
+
+		/* Those stripes should be initialized. */
+		ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
+
+		scrub_submit_initial_read(sctx, stripe);
+	}
+	blk_finish_plug(&plug);
+}
+
 static int flush_scrub_stripes(struct scrub_ctx *sctx)
 {
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
@@ -1669,11 +1704,13 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx)
 
 	ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state));
 
-	scrub_throttle_dev_io(sctx, sctx->stripes[0].dev,
-			      btrfs_stripe_nr_to_offset(nr_stripes));
-	for (int i = 0; i < nr_stripes; i++) {
-		stripe = &sctx->stripes[i];
-		scrub_submit_initial_read(sctx, stripe);
+	/* Submit the stripes which are populated but not submitted. */
+	if (nr_stripes % SCRUB_STRIPES_PER_GROUP) {
+		const int first_slot = round_down(nr_stripes,
+						  SCRUB_STRIPES_PER_GROUP);
+
+		submit_initial_group_read(sctx, first_slot,
+					  nr_stripes - first_slot);
 	}
 
 	for (int i = 0; i < nr_stripes; i++) {
@@ -1753,21 +1790,19 @@ static void raid56_scrub_wait_endio(struct bio *bio)
 
 static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *bg,
 			      struct btrfs_device *dev, int mirror_num,
-			      u64 logical, u32 length, u64 physical)
+			      u64 logical, u32 length, u64 physical,
+			      u64 *found_logical_ret)
 {
 	struct scrub_stripe *stripe;
 	int ret;
 
-	/* No available slot, submit all stripes and wait for them. */
-	if (sctx->cur_stripe >= SCRUB_STRIPES_PER_SCTX) {
-		ret = flush_scrub_stripes(sctx);
-		if (ret < 0)
-			return ret;
-	}
+	/*
+	 * There should always be one slot left, as caller filling the last
+	 * slot should flush them all.
+	 */
+	ASSERT(sctx->cur_stripe < SCRUB_TOTAL_STRIPES);
 
 	stripe = &sctx->stripes[sctx->cur_stripe];
-
-	/* We can queue one stripe using the remaining slot. */
 	scrub_reset_stripe(stripe);
 	ret = scrub_find_fill_first_stripe(bg, &sctx->extent_path,
 			&sctx->csum_path, dev, physical, mirror_num, logical,
@@ -1775,7 +1810,22 @@ static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *
 	/* Either >0 as no more extents or <0 for error. */
 	if (ret)
 		return ret;
+	if (found_logical_ret)
+		*found_logical_ret = stripe->logical;
 	sctx->cur_stripe++;
+
+	/* We filled one group, submit them. */
+	if (sctx->cur_stripe % SCRUB_STRIPES_PER_GROUP == 0) {
+		const int first_slot = sctx->cur_stripe -
+				       SCRUB_STRIPES_PER_GROUP;
+
+		submit_initial_group_read(sctx, first_slot,
+					  SCRUB_STRIPES_PER_GROUP);
+	}
+
+	/* Last slot used, flush them all. */
+	if (sctx->cur_stripe == SCRUB_TOTAL_STRIPES)
+		return flush_scrub_stripes(sctx);
 	return 0;
 }
 
@@ -1984,6 +2034,7 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 
 	/* Go through each extent items inside the logical range */
 	while (cur_logical < logical_end) {
+		u64 found_logical;
 		u64 cur_physical = physical + cur_logical - logical_start;
 
 		/* Canceled? */
@@ -2008,7 +2059,7 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 
 		ret = queue_scrub_stripe(sctx, bg, device, mirror_num,
 					 cur_logical, logical_end - cur_logical,
-					 cur_physical);
+					 cur_physical, &found_logical);
 		if (ret > 0) {
 			/* No more extent, just update the accounting */
 			sctx->stat.last_physical = physical + logical_length;
@@ -2018,9 +2069,7 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 		if (ret < 0)
 			break;
 
-		ASSERT(sctx->cur_stripe > 0);
-		cur_logical = sctx->stripes[sctx->cur_stripe - 1].logical
-			      + BTRFS_STRIPE_LEN;
+		cur_logical = found_logical + BTRFS_STRIPE_LEN;
 
 		/* Don't hold CPU for too long time */
 		cond_resched();
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] btrfs: scrub: don't go ordered workqueue for dev-replace
  2023-07-28 11:14 [PATCH 0/5] btrfs: scrub: improve the scrub performance Qu Wenruo
                   ` (2 preceding siblings ...)
  2023-07-28 11:14 ` [PATCH 3/5] btrfs: scrub: fix grouping of read IO Qu Wenruo
@ 2023-07-28 11:14 ` Qu Wenruo
  2023-07-28 11:14 ` [PATCH 5/5] btrfs: scrub: move write back of repaired sectors into scrub_stripe_read_repair_worker() Qu Wenruo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-07-28 11:14 UTC (permalink / raw)
  To: linux-btrfs

The workqueue fs_info->scrub_worker would go ordered workqueue if it's a
device replace operation.

However the scrub is relying on multiple workers to do data csum
verification, and we always submit several read requests in a row.

Thus there is no need to use ordered workqueue just for dev-replace.
We have extra synchronization (the main thread will always
submit-and-wait for dev-replace writes) to handle it for zoned devices.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index bfa8feee257f..1001c7724e4c 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2779,8 +2779,7 @@ static void scrub_workers_put(struct btrfs_fs_info *fs_info)
 /*
  * get a reference count on fs_info->scrub_workers. start worker if necessary
  */
-static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
-						int is_dev_replace)
+static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info)
 {
 	struct workqueue_struct *scrub_workers = NULL;
 	unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND;
@@ -2790,10 +2789,7 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
 	if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt))
 		return 0;
 
-	if (is_dev_replace)
-		scrub_workers = alloc_ordered_workqueue("btrfs-scrub", flags);
-	else
-		scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active);
+	scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active);
 	if (!scrub_workers)
 		return -ENOMEM;
 
@@ -2845,7 +2841,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	if (IS_ERR(sctx))
 		return PTR_ERR(sctx);
 
-	ret = scrub_workers_get(fs_info, is_dev_replace);
+	ret = scrub_workers_get(fs_info);
 	if (ret)
 		goto out_free_ctx;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] btrfs: scrub: move write back of repaired sectors into scrub_stripe_read_repair_worker()
  2023-07-28 11:14 [PATCH 0/5] btrfs: scrub: improve the scrub performance Qu Wenruo
                   ` (3 preceding siblings ...)
  2023-07-28 11:14 ` [PATCH 4/5] btrfs: scrub: don't go ordered workqueue for dev-replace Qu Wenruo
@ 2023-07-28 11:14 ` Qu Wenruo
  2023-07-28 12:38 ` [PATCH 0/5] btrfs: scrub: improve the scrub performance Martin Steigerwald
  2023-08-01 20:14 ` Jani Partanen
  6 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-07-28 11:14 UTC (permalink / raw)
  To: linux-btrfs

Currently the scrub_stripe_read_repair_worker() only do reads to rebuild
the corrupted sectors, it doesn't do any writeback.

The design is mostly to put writeback into a more ordered manner, to
co-operate with dev-replace with zoned mode, which requires every write
to be submitted in their bytenr order.

However the writeback for repaired sectors into the original mirror
doesn't need such strong sync requirement, as it can only happen for
non-zoned devices.

This patch would move the writeback for repaired sectors into
scrub_stripe_read_repair_worker(), which removes two calls sites for
repaired sectors writeback. (one from flush_scrub_stripes(), one from
scrub_raid56_parity_stripe())

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 73 ++++++++++++++++++------------------------------
 1 file changed, 27 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 1001c7724e4c..2c436cbcce83 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -987,6 +987,9 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 	spin_unlock(&sctx->stat_lock);
 }
 
+static void scrub_write_sectors(struct scrub_ctx *sctx, struct scrub_stripe *stripe,
+				unsigned long write_bitmap, bool dev_replace);
+
 /*
  * The main entrance for all read related scrub work, including:
  *
@@ -995,13 +998,16 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
  * - Go through the remaining mirrors and try to read as large blocksize as
  *   possible
  * - Go through all mirrors (including the failed mirror) sector-by-sector
+ * - Submit writeback for repaired sectors
  *
- * Writeback does not happen here, it needs extra synchronization.
+ * Writeback for dev-replace does not happen here, it needs extra
+ * synchronization for zoned devices.
  */
 static void scrub_stripe_read_repair_worker(struct work_struct *work)
 {
 	struct scrub_stripe *stripe = container_of(work, struct scrub_stripe, work);
-	struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
+	struct scrub_ctx *sctx = stripe->sctx;
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	int num_copies = btrfs_num_copies(fs_info, stripe->bg->start,
 					  stripe->bg->length);
 	int mirror;
@@ -1066,6 +1072,25 @@ static void scrub_stripe_read_repair_worker(struct work_struct *work)
 			goto out;
 	}
 out:
+	/*
+	 * Submit the repaired sectors.  For zoned case, we cannot do repair
+	 * in-place, but queue the bg to be relocated.
+	 */
+	if (btrfs_is_zoned(fs_info)) {
+		if (!bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors)) {
+			btrfs_repair_one_zone(fs_info,
+					      sctx->stripes[0].bg->start);
+		}
+	} else if (!sctx->readonly) {
+		unsigned long repaired;
+
+		bitmap_andnot(&repaired, &stripe->init_error_bitmap,
+			      &stripe->error_bitmap, stripe->nr_sectors);
+		scrub_write_sectors(sctx, stripe, repaired, false);
+		wait_scrub_stripe_io(stripe);
+	}
+
+	scrub_stripe_report_errors(sctx, stripe);
 	scrub_stripe_report_errors(stripe->sctx, stripe);
 	set_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state);
 	wake_up(&stripe->repair_wait);
@@ -1720,32 +1745,6 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx)
 			   test_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state));
 	}
 
-	/*
-	 * Submit the repaired sectors.  For zoned case, we cannot do repair
-	 * in-place, but queue the bg to be relocated.
-	 */
-	if (btrfs_is_zoned(fs_info)) {
-		for (int i = 0; i < nr_stripes; i++) {
-			stripe = &sctx->stripes[i];
-
-			if (!bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors)) {
-				btrfs_repair_one_zone(fs_info,
-						      sctx->stripes[0].bg->start);
-				break;
-			}
-		}
-	} else if (!sctx->readonly) {
-		for (int i = 0; i < nr_stripes; i++) {
-			unsigned long repaired;
-
-			stripe = &sctx->stripes[i];
-
-			bitmap_andnot(&repaired, &stripe->init_error_bitmap,
-				      &stripe->error_bitmap, stripe->nr_sectors);
-			scrub_write_sectors(sctx, stripe, repaired, false);
-		}
-	}
-
 	/* Submit for dev-replace. */
 	if (sctx->is_dev_replace) {
 		/*
@@ -1920,24 +1919,6 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 	/* For now, no zoned support for RAID56. */
 	ASSERT(!btrfs_is_zoned(sctx->fs_info));
 
-	/* Writeback for the repaired sectors. */
-	for (int i = 0; i < data_stripes; i++) {
-		unsigned long repaired;
-
-		stripe = &sctx->raid56_data_stripes[i];
-
-		bitmap_andnot(&repaired, &stripe->init_error_bitmap,
-			      &stripe->error_bitmap, stripe->nr_sectors);
-		scrub_write_sectors(sctx, stripe, repaired, false);
-	}
-
-	/* Wait for the above writebacks to finish. */
-	for (int i = 0; i < data_stripes; i++) {
-		stripe = &sctx->raid56_data_stripes[i];
-
-		wait_scrub_stripe_io(stripe);
-	}
-
 	/*
 	 * Now all data stripes are properly verified. Check if we have any
 	 * unrepaired, if so abort immediately or we could further corrupt the
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-07-28 11:14 [PATCH 0/5] btrfs: scrub: improve the scrub performance Qu Wenruo
                   ` (4 preceding siblings ...)
  2023-07-28 11:14 ` [PATCH 5/5] btrfs: scrub: move write back of repaired sectors into scrub_stripe_read_repair_worker() Qu Wenruo
@ 2023-07-28 12:38 ` Martin Steigerwald
  2023-07-28 16:50   ` David Sterba
  2023-08-01 20:14 ` Jani Partanen
  6 siblings, 1 reply; 16+ messages in thread
From: Martin Steigerwald @ 2023-07-28 12:38 UTC (permalink / raw)
  To: linux-btrfs, Qu Wenruo

Qu Wenruo - 28.07.23, 13:14:03 CEST:
> The first 3 patches would greately improve the scrub read performance,
> but unfortunately it's still not as fast as the pre-6.4 kernels.
> (2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels
> (2.2GiB vs 1.0GiB/s).

Thanks for the patch set.

What is the reason for not going back to the performance of the pre-6.4 
kernel? Isn't it possible with the new scrubbing method? In that case 
what improvements does the new scrubbing code have that warrant to have 
a lower performance?

Just like to understand the background of this a bit more. I do not mind 
a bit lower performance too much, especially in case it is outweighed by 
other benefits.

-- 
Martin



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-07-28 12:38 ` [PATCH 0/5] btrfs: scrub: improve the scrub performance Martin Steigerwald
@ 2023-07-28 16:50   ` David Sterba
  2023-07-28 21:14     ` Martin Steigerwald
  0 siblings, 1 reply; 16+ messages in thread
From: David Sterba @ 2023-07-28 16:50 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Qu Wenruo

On Fri, Jul 28, 2023 at 02:38:35PM +0200, Martin Steigerwald wrote:
> Qu Wenruo - 28.07.23, 13:14:03 CEST:
> > The first 3 patches would greately improve the scrub read performance,
> > but unfortunately it's still not as fast as the pre-6.4 kernels.
> > (2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels
> > (2.2GiB vs 1.0GiB/s).
> 
> Thanks for the patch set.
> 
> What is the reason for not going back to the performance of the pre-6.4 
> kernel? Isn't it possible with the new scrubbing method? In that case 
> what improvements does the new scrubbing code have that warrant to have 
> a lower performance?

Lower performance was not expected and needs to be brought back. A minor
decrease would be tolerable but that's something around 5%, not 60%.

> Just like to understand the background of this a bit more. I do not mind 
> a bit lower performance too much, especially in case it is outweighed by 
> other benefits.

The code in scrub was from 3.0 times and since then new features have
been implemented, extending the code became hard over time so a bigger
update was done restructuring how the IO is done.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-07-28 16:50   ` David Sterba
@ 2023-07-28 21:14     ` Martin Steigerwald
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Steigerwald @ 2023-07-28 21:14 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs, Qu Wenruo

David Sterba - 28.07.23, 18:50:37 CEST:
> On Fri, Jul 28, 2023 at 02:38:35PM +0200, Martin Steigerwald wrote:
> > Qu Wenruo - 28.07.23, 13:14:03 CEST:
> > > The first 3 patches would greately improve the scrub read
> > > performance, but unfortunately it's still not as fast as the
> > > pre-6.4 kernels. (2.2GiB/s vs 3.0GiB/s), but still much better
> > > than 6.4 kernels (2.2GiB vs 1.0GiB/s).
> > 
> > Thanks for the patch set.
> > 
> > What is the reason for not going back to the performance of the
> > pre-6.4 kernel? Isn't it possible with the new scrubbing method? In
> > that case what improvements does the new scrubbing code have that
> > warrant to have a lower performance?
> 
> Lower performance was not expected and needs to be brought back. A
> minor decrease would be tolerable but that's something around 5%, not
> 60%.

Okay. Best of success with improving performance again.

> > Just like to understand the background of this a bit more. I do not
> > mind a bit lower performance too much, especially in case it is
> > outweighed by other benefits.
> 
> The code in scrub was from 3.0 times and since then new features have
> been implemented, extending the code became hard over time so a bigger
> update was done restructuring how the IO is done.

Okay, thanks for explaining.

-- 
Martin



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-07-28 11:14 [PATCH 0/5] btrfs: scrub: improve the scrub performance Qu Wenruo
                   ` (5 preceding siblings ...)
  2023-07-28 12:38 ` [PATCH 0/5] btrfs: scrub: improve the scrub performance Martin Steigerwald
@ 2023-08-01 20:14 ` Jani Partanen
  2023-08-01 22:06   ` Qu Wenruo
  6 siblings, 1 reply; 16+ messages in thread
From: Jani Partanen @ 2023-08-01 20:14 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hello, I did some testing with 4 x 320GB HDD's. Meta raid1c4 and data 
raid 5.

Kernel  6.3.12

btrfs scrub start -B /dev/sdb

scrub done for 6691cb53-271b-4abd-b2ab-143c41027924
Scrub started:    Tue Aug  1 04:00:39 2023
Status:           finished
Duration:         2:37:35
Total to scrub:   149.58GiB
Rate:             16.20MiB/s
Error summary:    no errors found


Kernel 6.5.0-rc3

btrfs scrub start -B /dev/sdb

scrub done for 6691cb53-271b-4abd-b2ab-143c41027924
Scrub started:    Tue Aug  1 08:41:12 2023
Status:           finished
Duration:         1:31:03
Total to scrub:   299.16GiB
Rate:             56.08MiB/s
Error summary:    no errors found


So much better speed but Total to scrub reporting seems strange.

df -h /dev/sdb
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        1,2T  599G  292G  68% /mnt


Looks like old did like 1/4 of total data what seems like right because 
I have 4 drives.

New did  about 1/2 of total data what seems wrong.

And if I do scrub against mount point:

btrfs scrub start -B /mnt/
scrub done for 6691cb53-271b-4abd-b2ab-143c41027924
Scrub started:    Tue Aug  1 11:03:56 2023
Status:           finished
Duration:         10:02:44
Total to scrub:   1.17TiB
Rate:             33.89MiB/s
Error summary:    no errors found


Then performance goes down to toilet and now Total to scrub reporting is 
like 2/1

btrfs version
btrfs-progs v6.3.3

Is it btrfs-progs issue with reporting? What about raid 5 scrub 
performance, why it is so bad?


About disks, they are old WD Blue drives what can do about 100MB/s 
read/write on average.


On 28/07/2023 14.14, Qu Wenruo wrote:
> [REPO]
> https://github.com/adam900710/linux/tree/scrub_testing
>
> [CHANGELOG]
> v1:
> - Rebased to latest misc-next
>
> - Rework the read IO grouping patch
>    David has found some crashes mostly related to scrub performance
>    fixes, meanwhile the original grouping patch has one extra flag,
>    SCRUB_FLAG_READ_SUBMITTED, to avoid double submitting.
>
>    But this flag can be avoided as we can easily avoid double submitting
>    just by properly checking the sctx->nr_stripe variable.
>
>    This reworked grouping read IO patch should be safer compared to the
>    initial version, with better code structure.
>
>    Unfortunately, the final performance is worse than the initial version
>    (2.2GiB/s vs 2.5GiB/s), but it should be less racy thus safer.
>
> - Re-order the patches
>    The first 3 patches are the main fixes, and I put safer patches first,
>    so even if David still found crash at certain patch, the remaining can
>    be dropped if needed.
>
> There is a huge scrub performance drop introduced by v6.4 kernel, that
> the scrub performance is only around 1/3 for large data extents.
>
> There are several causes:
>
> - Missing blk plug
>    This means read requests won't be merged by block layer, this can
>    hugely reduce the read performance.
>
> - Extra time spent on extent/csum tree search
>    This including extra path allocation/freeing and tree searchs.
>    This is especially obvious for large data extents, as previously we
>    only do one csum search per 512K, but now we do one csum search per
>    64K, an 8 times increase in csum tree search.
>
> - Less concurrency
>    Mostly due to the fact that we're doing submit-and-wait, thus much
>    lower queue depth, affecting things like NVME which benefits a lot
>    from high concurrency.
>
> The first 3 patches would greately improve the scrub read performance,
> but unfortunately it's still not as fast as the pre-6.4 kernels.
> (2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels (2.2GiB
> vs 1.0GiB/s).
>
> Qu Wenruo (5):
>    btrfs: scrub: avoid unnecessary extent tree search preparing stripes
>    btrfs: scrub: avoid unnecessary csum tree search preparing stripes
>    btrfs: scrub: fix grouping of read IO
>    btrfs: scrub: don't go ordered workqueue for dev-replace
>    btrfs: scrub: move write back of repaired sectors into
>      scrub_stripe_read_repair_worker()
>
>   fs/btrfs/file-item.c |  33 +++---
>   fs/btrfs/file-item.h |   6 +-
>   fs/btrfs/raid56.c    |   4 +-
>   fs/btrfs/scrub.c     | 234 ++++++++++++++++++++++++++-----------------
>   4 files changed, 169 insertions(+), 108 deletions(-)
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-08-01 20:14 ` Jani Partanen
@ 2023-08-01 22:06   ` Qu Wenruo
  2023-08-01 23:48     ` Jani Partanen
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2023-08-01 22:06 UTC (permalink / raw)
  To: Jani Partanen, Qu Wenruo, linux-btrfs



On 2023/8/2 04:14, Jani Partanen wrote:
> Hello, I did some testing with 4 x 320GB HDD's. Meta raid1c4 and data
> raid 5.

RAID5 has other problems related to scrub performance unfortunately.

>
> Kernel  6.3.12
>
> btrfs scrub start -B /dev/sdb
>
> scrub done for 6691cb53-271b-4abd-b2ab-143c41027924
> Scrub started:    Tue Aug  1 04:00:39 2023
> Status:           finished
> Duration:         2:37:35
> Total to scrub:   149.58GiB
> Rate:             16.20MiB/s
> Error summary:    no errors found
>
>
> Kernel 6.5.0-rc3
>
> btrfs scrub start -B /dev/sdb
>
> scrub done for 6691cb53-271b-4abd-b2ab-143c41027924
> Scrub started:    Tue Aug  1 08:41:12 2023
> Status:           finished
> Duration:         1:31:03
> Total to scrub:   299.16GiB
> Rate:             56.08MiB/s
> Error summary:    no errors found
>
>
> So much better speed but Total to scrub reporting seems strange.
>
> df -h /dev/sdb
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sdb        1,2T  599G  292G  68% /mnt
>
>
> Looks like old did like 1/4 of total data what seems like right because
> I have 4 drives.
>
> New did  about 1/2 of total data what seems wrong.

I checked the kernel part of the progress reporting, for single device
scrub for RAID56, if it's a data stripe it contributes to the scrubbed
bytes, but if it's a P/Q stripe it should not contribute to the value.

Thus 1/4 should be the correct value.

However there is another factor in btrfs-progs, which determines how to
report the numbers.

There is a fix for it already merged in v6.3.2, but it seems to have
other problems involved.

>
> And if I do scrub against mount point:
>
> btrfs scrub start -B /mnt/
> scrub done for 6691cb53-271b-4abd-b2ab-143c41027924
> Scrub started:    Tue Aug  1 11:03:56 2023
> Status:           finished
> Duration:         10:02:44
> Total to scrub:   1.17TiB
> Rate:             33.89MiB/s
> Error summary:    no errors found
>
>
> Then performance goes down to toilet and now Total to scrub reporting is
> like 2/1
>
> btrfs version
> btrfs-progs v6.3.3
>
> Is it btrfs-progs issue with reporting?

Can you try with -BdR option?

It shows the raw numbers, which is the easiest way to determine if it's
a bug in btrfs-progs or kernel.

> What about raid 5 scrub
> performance, why it is so bad?

It's explained in this cover letter:
https://lore.kernel.org/linux-btrfs/cover.1688368617.git.wqu@suse.com/

In short, RAID56 full fs scrub is causing too many duplicated reads, and
the root cause is, the per-device scrub is never a good idea for RAID56.

That's why I'm trying to introduce the new scrub flag for that.

Thanks,
Qu

>
>
> About disks, they are old WD Blue drives what can do about 100MB/s
> read/write on average.
>
>
> On 28/07/2023 14.14, Qu Wenruo wrote:
>> [REPO]
>> https://github.com/adam900710/linux/tree/scrub_testing
>>
>> [CHANGELOG]
>> v1:
>> - Rebased to latest misc-next
>>
>> - Rework the read IO grouping patch
>>    David has found some crashes mostly related to scrub performance
>>    fixes, meanwhile the original grouping patch has one extra flag,
>>    SCRUB_FLAG_READ_SUBMITTED, to avoid double submitting.
>>
>>    But this flag can be avoided as we can easily avoid double submitting
>>    just by properly checking the sctx->nr_stripe variable.
>>
>>    This reworked grouping read IO patch should be safer compared to the
>>    initial version, with better code structure.
>>
>>    Unfortunately, the final performance is worse than the initial version
>>    (2.2GiB/s vs 2.5GiB/s), but it should be less racy thus safer.
>>
>> - Re-order the patches
>>    The first 3 patches are the main fixes, and I put safer patches first,
>>    so even if David still found crash at certain patch, the remaining can
>>    be dropped if needed.
>>
>> There is a huge scrub performance drop introduced by v6.4 kernel, that
>> the scrub performance is only around 1/3 for large data extents.
>>
>> There are several causes:
>>
>> - Missing blk plug
>>    This means read requests won't be merged by block layer, this can
>>    hugely reduce the read performance.
>>
>> - Extra time spent on extent/csum tree search
>>    This including extra path allocation/freeing and tree searchs.
>>    This is especially obvious for large data extents, as previously we
>>    only do one csum search per 512K, but now we do one csum search per
>>    64K, an 8 times increase in csum tree search.
>>
>> - Less concurrency
>>    Mostly due to the fact that we're doing submit-and-wait, thus much
>>    lower queue depth, affecting things like NVME which benefits a lot
>>    from high concurrency.
>>
>> The first 3 patches would greately improve the scrub read performance,
>> but unfortunately it's still not as fast as the pre-6.4 kernels.
>> (2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels (2.2GiB
>> vs 1.0GiB/s).
>>
>> Qu Wenruo (5):
>>    btrfs: scrub: avoid unnecessary extent tree search preparing stripes
>>    btrfs: scrub: avoid unnecessary csum tree search preparing stripes
>>    btrfs: scrub: fix grouping of read IO
>>    btrfs: scrub: don't go ordered workqueue for dev-replace
>>    btrfs: scrub: move write back of repaired sectors into
>>      scrub_stripe_read_repair_worker()
>>
>>   fs/btrfs/file-item.c |  33 +++---
>>   fs/btrfs/file-item.h |   6 +-
>>   fs/btrfs/raid56.c    |   4 +-
>>   fs/btrfs/scrub.c     | 234 ++++++++++++++++++++++++++-----------------
>>   4 files changed, 169 insertions(+), 108 deletions(-)
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-08-01 22:06   ` Qu Wenruo
@ 2023-08-01 23:48     ` Jani Partanen
  2023-08-02  1:56       ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Jani Partanen @ 2023-08-01 23:48 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs


On 02/08/2023 1.06, Qu Wenruo wrote:
>
> Can you try with -BdR option?
>
> It shows the raw numbers, which is the easiest way to determine if it's
> a bug in btrfs-progs or kernel.
>

Here is single device result:

btrfs scrub start -BdR /dev/sdb

Scrub device /dev/sdb (id 1) done
Scrub started:    Wed Aug  2 01:33:21 2023
Status:           finished
Duration:         0:44:29
         data_extents_scrubbed: 4902956
         tree_extents_scrubbed: 60494
         data_bytes_scrubbed: 321301020672
         tree_bytes_scrubbed: 991133696
         read_errors: 0
         csum_errors: 0
         verify_errors: 0
         no_csum: 22015840
         csum_discards: 0
         super_errors: 0
         malloc_errors: 0
         uncorrectable_errors: 0
         unverified_errors: 0
         corrected_errors: 0
         last_physical: 256679870464


I'll do against mountpoint when I go to sleep because it gonna take long.

>> What about raid 5 scrub
>> performance, why it is so bad?
>
> It's explained in this cover letter:
> https://lore.kernel.org/linux-btrfs/cover.1688368617.git.wqu@suse.com/
>
> In short, RAID56 full fs scrub is causing too many duplicated reads, and
> the root cause is, the per-device scrub is never a good idea for RAID56.
>
> That's why I'm trying to introduce the new scrub flag for that.
>
Ah, so there is different patchset for raid5 scrub, good to know. I'm 
gonna build that branch and test it. Also let me know if I could help 
somehow to do that stress testing. These drives are deticated for 
testing. I am running VM under Hyper-V and disk are passthrough directly 
to VM.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-08-01 23:48     ` Jani Partanen
@ 2023-08-02  1:56       ` Qu Wenruo
  2023-08-02  2:15         ` Jani Partanen
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2023-08-02  1:56 UTC (permalink / raw)
  To: Jani Partanen, Qu Wenruo, linux-btrfs



On 2023/8/2 07:48, Jani Partanen wrote:
>
> On 02/08/2023 1.06, Qu Wenruo wrote:
>>
>> Can you try with -BdR option?
>>
>> It shows the raw numbers, which is the easiest way to determine if it's
>> a bug in btrfs-progs or kernel.
>>
>
> Here is single device result:
>
> btrfs scrub start -BdR /dev/sdb
>
> Scrub device /dev/sdb (id 1) done
> Scrub started:    Wed Aug  2 01:33:21 2023
> Status:           finished
> Duration:         0:44:29
>          data_extents_scrubbed: 4902956
>          tree_extents_scrubbed: 60494
>          data_bytes_scrubbed: 321301020672

So the btrfs scrub report is doing the correct report using the values
from kernel.

And considering the used space is around 600G, divided by 4 disks (aka,
3 data stripes + 1 parity stripes), it's not that weird, as we would got
around 200G per device (parity doesn't contribute to the scrubbed bytes).

Especially considering your metadata is RAID1C4, it means we should only
have more than 200G.
Instead it's the old report of less than 200G doesn't seem correct.

Mind to provide the output of "btrfs fi usage <mnt>" to verify my
assumption?

>          tree_bytes_scrubbed: 991133696
>          read_errors: 0
>          csum_errors: 0
>          verify_errors: 0
>          no_csum: 22015840
>          csum_discards: 0
>          super_errors: 0
>          malloc_errors: 0
>          uncorrectable_errors: 0
>          unverified_errors: 0
>          corrected_errors: 0
>          last_physical: 256679870464
>
>
> I'll do against mountpoint when I go to sleep because it gonna take long.
>
>>> What about raid 5 scrub
>>> performance, why it is so bad?
>>
>> It's explained in this cover letter:
>> https://lore.kernel.org/linux-btrfs/cover.1688368617.git.wqu@suse.com/
>>
>> In short, RAID56 full fs scrub is causing too many duplicated reads, and
>> the root cause is, the per-device scrub is never a good idea for RAID56.
>>
>> That's why I'm trying to introduce the new scrub flag for that.
>>
> Ah, so there is different patchset for raid5 scrub, good to know. I'm
> gonna build that branch and test it.

Although it's not recommended to test it for now, as we're still
handling the performance drop, thus the patchset may not apply cleanly.

> Also let me know if I could help
> somehow to do that stress testing. These drives are deticated for
> testing. I am running VM under Hyper-V and disk are passthrough directly
> to VM.
>

Sure, I'll CC you when refreshing the patchset, extra tests are always
appreciated.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-08-02  1:56       ` Qu Wenruo
@ 2023-08-02  2:15         ` Jani Partanen
  2023-08-02  2:20           ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Jani Partanen @ 2023-08-02  2:15 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs


On 02/08/2023 4.56, Qu Wenruo wrote:
>
> So the btrfs scrub report is doing the correct report using the values
> from kernel.
>
> And considering the used space is around 600G, divided by 4 disks (aka,
> 3 data stripes + 1 parity stripes), it's not that weird, as we would got
> around 200G per device (parity doesn't contribute to the scrubbed bytes).
>
> Especially considering your metadata is RAID1C4, it means we should only
> have more than 200G.
> Instead it's the old report of less than 200G doesn't seem correct.
>
> Mind to provide the output of "btrfs fi usage <mnt>" to verify my
> assumption?
>
btrfs fi usage /mnt/
Overall:
     Device size:                   1.16TiB
     Device allocated:            844.25GiB
     Device unallocated:          348.11GiB
     Device missing:                  0.00B
     Device slack:                    0.00B
     Used:                        799.86GiB
     Free (estimated):            289.58GiB      (min: 115.52GiB)
     Free (statfs, df):           289.55GiB
     Data ratio:                       1.33
     Metadata ratio:                   4.00
     Global reserve:              471.80MiB      (used: 0.00B)
     Multiple profiles:                  no

Data,RAID5: Size:627.00GiB, Used:598.51GiB (95.46%)
    /dev/sdb      209.00GiB
    /dev/sdc      209.00GiB
    /dev/sdd      209.00GiB
    /dev/sde      209.00GiB

Metadata,RAID1C4: Size:2.00GiB, Used:472.56MiB (23.07%)
    /dev/sdb        2.00GiB
    /dev/sdc        2.00GiB
    /dev/sdd        2.00GiB
    /dev/sde        2.00GiB

System,RAID1C4: Size:64.00MiB, Used:64.00KiB (0.10%)
    /dev/sdb       64.00MiB
    /dev/sdc       64.00MiB
    /dev/sdd       64.00MiB
    /dev/sde       64.00MiB

Unallocated:
    /dev/sdb       87.03GiB
    /dev/sdc       87.03GiB
    /dev/sdd       87.03GiB

    /dev/sde       87.03GiB


There is 1 extra 2GB file now so thats why it show little more usage now.


> Sure, I'll CC you when refreshing the patchset, extra tests are always
> appreciated.
>
Sound good, thanks!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-08-02  2:15         ` Jani Partanen
@ 2023-08-02  2:20           ` Qu Wenruo
  2023-08-03  6:30             ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2023-08-02  2:20 UTC (permalink / raw)
  To: Jani Partanen, Qu Wenruo, linux-btrfs



On 2023/8/2 10:15, Jani Partanen wrote:
> 
> On 02/08/2023 4.56, Qu Wenruo wrote:
>>
>> So the btrfs scrub report is doing the correct report using the values
>> from kernel.
>>
>> And considering the used space is around 600G, divided by 4 disks (aka,
>> 3 data stripes + 1 parity stripes), it's not that weird, as we would got
>> around 200G per device (parity doesn't contribute to the scrubbed bytes).
>>
>> Especially considering your metadata is RAID1C4, it means we should only
>> have more than 200G.
>> Instead it's the old report of less than 200G doesn't seem correct.
>>
>> Mind to provide the output of "btrfs fi usage <mnt>" to verify my
>> assumption?
>>
> btrfs fi usage /mnt/
> Overall:
>      Device size:                   1.16TiB
>      Device allocated:            844.25GiB
>      Device unallocated:          348.11GiB
>      Device missing:                  0.00B
>      Device slack:                    0.00B
>      Used:                        799.86GiB
>      Free (estimated):            289.58GiB      (min: 115.52GiB)
>      Free (statfs, df):           289.55GiB
>      Data ratio:                       1.33
>      Metadata ratio:                   4.00
>      Global reserve:              471.80MiB      (used: 0.00B)
>      Multiple profiles:                  no
> 
> Data,RAID5: Size:627.00GiB, Used:598.51GiB (95.46%)
>     /dev/sdb      209.00GiB
>     /dev/sdc      209.00GiB
>     /dev/sdd      209.00GiB
>     /dev/sde      209.00GiB

OK, my previous calculation is incorrect...

For each device there should be 209GiB used by RAID5 chunks, and only 
3/4 of them contributes to the scrubbed data bytes.

Thus there seems to be some double accounting.

Definitely needs extra digging for this situation.

Thanks,
Qu

> 
> Metadata,RAID1C4: Size:2.00GiB, Used:472.56MiB (23.07%)
>     /dev/sdb        2.00GiB
>     /dev/sdc        2.00GiB
>     /dev/sdd        2.00GiB
>     /dev/sde        2.00GiB
> 
> System,RAID1C4: Size:64.00MiB, Used:64.00KiB (0.10%)
>     /dev/sdb       64.00MiB
>     /dev/sdc       64.00MiB
>     /dev/sdd       64.00MiB
>     /dev/sde       64.00MiB
> 
> Unallocated:
>     /dev/sdb       87.03GiB
>     /dev/sdc       87.03GiB
>     /dev/sdd       87.03GiB
> 
>     /dev/sde       87.03GiB
> 
> 
> There is 1 extra 2GB file now so thats why it show little more usage now.
> 
> 
>> Sure, I'll CC you when refreshing the patchset, extra tests are always
>> appreciated.
>>
> Sound good, thanks!
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: scrub: improve the scrub performance
  2023-08-02  2:20           ` Qu Wenruo
@ 2023-08-03  6:30             ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-08-03  6:30 UTC (permalink / raw)
  To: Jani Partanen, Qu Wenruo, linux-btrfs



On 2023/8/2 10:20, Qu Wenruo wrote:
> 
> 
> On 2023/8/2 10:15, Jani Partanen wrote:
>>
>> On 02/08/2023 4.56, Qu Wenruo wrote:
>>>
>>> So the btrfs scrub report is doing the correct report using the values
>>> from kernel.
>>>
>>> And considering the used space is around 600G, divided by 4 disks (aka,
>>> 3 data stripes + 1 parity stripes), it's not that weird, as we would got
>>> around 200G per device (parity doesn't contribute to the scrubbed 
>>> bytes).
>>>
>>> Especially considering your metadata is RAID1C4, it means we should only
>>> have more than 200G.
>>> Instead it's the old report of less than 200G doesn't seem correct.
>>>
>>> Mind to provide the output of "btrfs fi usage <mnt>" to verify my
>>> assumption?
>>>
>> btrfs fi usage /mnt/
>> Overall:
>>      Device size:                   1.16TiB
>>      Device allocated:            844.25GiB
>>      Device unallocated:          348.11GiB
>>      Device missing:                  0.00B
>>      Device slack:                    0.00B
>>      Used:                        799.86GiB
>>      Free (estimated):            289.58GiB      (min: 115.52GiB)
>>      Free (statfs, df):           289.55GiB
>>      Data ratio:                       1.33
>>      Metadata ratio:                   4.00
>>      Global reserve:              471.80MiB      (used: 0.00B)
>>      Multiple profiles:                  no
>>
>> Data,RAID5: Size:627.00GiB, Used:598.51GiB (95.46%)
>>     /dev/sdb      209.00GiB
>>     /dev/sdc      209.00GiB
>>     /dev/sdd      209.00GiB
>>     /dev/sde      209.00GiB
> 
> OK, my previous calculation is incorrect...
> 
> For each device there should be 209GiB used by RAID5 chunks, and only 
> 3/4 of them contributes to the scrubbed data bytes.
> 
> Thus there seems to be some double accounting.
> 
> Definitely needs extra digging for this situation.

Well, this turns out to be something related to the patchset.

If you don't apply the patchset, the reporting is correct.

The problem is in the last patch, which calls 
scrub_stripe_report_errors() twice, thus double accounting the values.

I'll fix it soon.

Thanks for spotting this one!
Qu

> 
> Thanks,
> Qu
> 
>>
>> Metadata,RAID1C4: Size:2.00GiB, Used:472.56MiB (23.07%)
>>     /dev/sdb        2.00GiB
>>     /dev/sdc        2.00GiB
>>     /dev/sdd        2.00GiB
>>     /dev/sde        2.00GiB
>>
>> System,RAID1C4: Size:64.00MiB, Used:64.00KiB (0.10%)
>>     /dev/sdb       64.00MiB
>>     /dev/sdc       64.00MiB
>>     /dev/sdd       64.00MiB
>>     /dev/sde       64.00MiB
>>
>> Unallocated:
>>     /dev/sdb       87.03GiB
>>     /dev/sdc       87.03GiB
>>     /dev/sdd       87.03GiB
>>
>>     /dev/sde       87.03GiB
>>
>>
>> There is 1 extra 2GB file now so thats why it show little more usage now.
>>
>>
>>> Sure, I'll CC you when refreshing the patchset, extra tests are always
>>> appreciated.
>>>
>> Sound good, thanks!
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-08-03  6:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-28 11:14 [PATCH 0/5] btrfs: scrub: improve the scrub performance Qu Wenruo
2023-07-28 11:14 ` [PATCH 1/5] btrfs: scrub: avoid unnecessary extent tree search preparing stripes Qu Wenruo
2023-07-28 11:14 ` [PATCH 2/5] btrfs: scrub: avoid unnecessary csum " Qu Wenruo
2023-07-28 11:14 ` [PATCH 3/5] btrfs: scrub: fix grouping of read IO Qu Wenruo
2023-07-28 11:14 ` [PATCH 4/5] btrfs: scrub: don't go ordered workqueue for dev-replace Qu Wenruo
2023-07-28 11:14 ` [PATCH 5/5] btrfs: scrub: move write back of repaired sectors into scrub_stripe_read_repair_worker() Qu Wenruo
2023-07-28 12:38 ` [PATCH 0/5] btrfs: scrub: improve the scrub performance Martin Steigerwald
2023-07-28 16:50   ` David Sterba
2023-07-28 21:14     ` Martin Steigerwald
2023-08-01 20:14 ` Jani Partanen
2023-08-01 22:06   ` Qu Wenruo
2023-08-01 23:48     ` Jani Partanen
2023-08-02  1:56       ` Qu Wenruo
2023-08-02  2:15         ` Jani Partanen
2023-08-02  2:20           ` Qu Wenruo
2023-08-03  6:30             ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.