[PATCH RFC 0/4] btrfs: refactor scrub entrances for each profile

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/4] btrfs: refactor scrub entrances for each profile
@ 2021-12-21  2:33 Qu Wenruo
  2021-12-21  2:33 ` [PATCH RFC 1/4] btrfs: introduce a helper to locate an extent item Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Qu Wenruo @ 2021-12-21  2:33 UTC (permalink / raw)
  To: linux-btrfs

Merry Chrismas and happy new year!

The branch is based on several recent submitted small cleanups, thus
it's better to fetch the branch from github:

https://github.com/adam900710/linux/tree/refactor_scrub

[CRAP-BUT-IT-WORKS(TM)]
Scrub is one of the area we seldom touch because:

- It's a mess
  Just check scrub_stripe() function.
  It's a function scrubbing a stripe for *all* profiles.

  It's near 400 lines for a single complex function, with double while()
  loop and several different jumps inside the loop.

  Not to mention the lack of comments for various structures.

  This should and will never happen under our current code standard.

- It just works
  I have hit more than 10 bugs during development, and I just want to
  give up the refactor, as even the code is crap, it works, passing the
  existing scrub/replace group.
  While no matter how small code change I'm doing, it always fails to pass
  the same tests.

[REFACTOR-IDEA]

The core idea here, is to get rid of one-fit-all solution for
scrub_stripe().

Instead, we explicitly separate the scrub into 3 groups (the idea is
from my btrfs-fuse project):

- Simple-mirror based profiles
  This includes SINGLE/DUP/RAID1/RAID1C* profiles.
  They have no stripe, and their repair is purely mirror based.

- Simple-stripe based profiles
  This includes RAID0/RAID10 profiles.
  They are just simple stripe (without P/Q nor rotation), with extra
  mirrors to support their repair.

- RAID56
  The most complex profiles, they have extra P/Q, and have rotation.

[REFACTOR-IMPLEMENTATION]

So we have 3 entrances for all those supported profiles:

- scrub_simple_mirror()
  For SINGLE/DUP/RAID1/RAID1C* profiles.
  Just go through each extent and scrub the extent.

- scrub_simple_stripe()
  For RAID0/RAID10 profiles.
  Instead we go each data stripe first, then inside each data stripe, we
  can call scrub_simple_mirror(), since after stripe split, RAID0 is
  just SINGLE and RAID10 is just RAID1.

- scrub_stripe() untouched for RAID56
  RAID56 still has all the complex things to do, but they can still be
  split into two types (already done by the original code)

  * data stripes
    They are no different in the verification part, RAID56 is just
    SINGLE if we ignore the repair path.
    It's only in repair path that our path divides.

    So we can reuse scrub_simple_mirror() again.

  * P/Q stripes
    They already have a dedicated function handling the case.

With all these refactors, although we have several more functions, we
get rid of:

- A double while () loop

- Several jumps inside the double loop

- Complex calculation to try to fit all profiles

And we get:

- Better comments

- More dedicated functions

- A better basis for further refactors

[RFC]
- Whether this refactor is preferred
  As one entrance, scrub_simple_mirror(), gets reused in other
  entrances, not sure if this is layer breakage.

- More testing in the real world
  Although it passes all known scrub/replace tests.
  I'm not yet confident in the RAID56 coverage.

Qu Wenruo (4):
  btrfs: introduce a helper to locate an extent item
  btrfs: introduce dedicated helper to scrub simple-mirror based range
  btrfs: introduce dedicated helper to scrub simple-stripe based range
  btrfs: use scrub_simple_mirror() to handle RAID56 data stripe scrub

 fs/btrfs/scrub.c | 680 ++++++++++++++++++++++++++++-------------------
 1 file changed, 400 insertions(+), 280 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC 1/4] btrfs: introduce a helper to locate an extent item
  2021-12-21  2:33 [PATCH RFC 0/4] btrfs: refactor scrub entrances for each profile Qu Wenruo
@ 2021-12-21  2:33 ` Qu Wenruo
  2021-12-21  2:33 ` [PATCH RFC 2/4] btrfs: introduce dedicated helper to scrub simple-mirror based range Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2021-12-21  2:33 UTC (permalink / raw)
  To: linux-btrfs

The new helper, find_first_extent_item(), will locate an extent item
(either EXTENT_ITEM or METADATA_ITEM) which covers the any byte of the
search range.

This helper will later be used to refactor scrub code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f20ad60dcc0e..640f3de38e18 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3172,6 +3172,114 @@ static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical,
 	return ret;
 }
 
+/*
+ * Return 0 if the extent item range covers any byte of the range.
+ * Return <0 if the extent item is before @search_start.
+ * Return >0 if the extent item is after @start_start + @search_len.
+ */
+static int compare_extent_item_range(struct btrfs_path *path,
+				     u64 search_start, u64 search_len)
+{
+	struct btrfs_fs_info *fs_info = path->nodes[0]->fs_info;
+	u64 len;
+	struct btrfs_key key;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	ASSERT(key.type == BTRFS_EXTENT_ITEM_KEY ||
+	       key.type == BTRFS_METADATA_ITEM_KEY);
+	if (key.type == BTRFS_METADATA_ITEM_KEY)
+		len = fs_info->nodesize;
+	else
+		len = key.offset;
+
+	if (key.objectid + len <= search_start)
+		return -1;
+	if (key.objectid >= search_start + search_len)
+		return 1;
+	return 0;
+}
+
+/*
+ * Helper to locate one extent item which covers any byte in range
+ * [@search_start, @search_start + @search_length)
+ *
+ * If the path is not initialized, we will initialize the search by doing
+ * a btrfs_search_slot().
+ * If the path is already initialized, we will use the path as the initial
+ * slot, to avoid duplicated btrfs_search_slot() calls.
+ *
+ * NOTE: If an extent item starts before @search_start, we will still
+ * return the extent item. This is for data extent crossing stripe boundary.
+ *
+ * Return 0 if we found such extent item, and @path will point to the
+ * extent item.
+ * Return >0 if no such extent item can be found, and @path will be released.
+ * Return <0 if hit fatal error, and @path will be released.
+ */
+static int find_first_extent_item(struct btrfs_root *extent_root,
+				  struct btrfs_path *path,
+				  u64 search_start, u64 search_len)
+{
+	struct btrfs_fs_info *fs_info = extent_root->fs_info;
+	struct btrfs_key key;
+	int ret;
+
+	/* Continue using the existing path */
+	if (path->nodes[0])
+		goto search_forward;
+
+	if (btrfs_fs_incompat(fs_info, SKINNY_METADATA))
+		key.type = BTRFS_METADATA_ITEM_KEY;
+	else
+		key.type = BTRFS_EXTENT_ITEM_KEY;
+	key.objectid = search_start;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
+	if (ret < 0)
+		return ret;
+
+	ASSERT(ret > 0);
+	/*
+	 * Here we intentionally pass 0 as @min_objectid, as there could be
+	 * an extent item starting before @search_start.
+	 */
+	ret = btrfs_previous_extent_item(extent_root, path, 0);
+	if (ret < 0)
+		return ret;
+	/*
+	 * No matter whether we have found an extent item, the next loop will
+	 * properly do every check on the key.
+	 */
+search_forward:
+	while (true) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (key.objectid >= search_start + search_len)
+			break;
+		if (key.type != BTRFS_METADATA_ITEM_KEY &&
+		    key.type != BTRFS_EXTENT_ITEM_KEY)
+			goto next;
+
+		ret = compare_extent_item_range(path, search_start, search_len);
+		if (ret == 0)
+			return ret;
+		if (ret > 0)
+			break;
+next:
+		path->slots[0]++;
+		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+			ret = btrfs_next_leaf(extent_root, path);
+			if (ret) {
+				/* Either no more item or fatal error */
+				btrfs_release_path(path);
+				return ret;
+			}
+		}
+	}
+	btrfs_release_path(path);
+	return 1;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct btrfs_block_group *bg,
 					   struct map_lookup *map,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC 2/4] btrfs: introduce dedicated helper to scrub simple-mirror based range
  2021-12-21  2:33 [PATCH RFC 0/4] btrfs: refactor scrub entrances for each profile Qu Wenruo
  2021-12-21  2:33 ` [PATCH RFC 1/4] btrfs: introduce a helper to locate an extent item Qu Wenruo
@ 2021-12-21  2:33 ` Qu Wenruo
  2021-12-21  2:33 ` [PATCH RFC 3/4] btrfs: introduce dedicated helper to scrub simple-stripe " Qu Wenruo
  2021-12-21  2:33 ` [PATCH RFC 4/4] btrfs: use scrub_simple_mirror() to handle RAID56 data stripe scrub Qu Wenruo
  3 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2021-12-21  2:33 UTC (permalink / raw)
  To: linux-btrfs

The new helper, scrub_simple_mirror(), will scrub all extents inside a range
which only has simple mirror based duplication.

This covers every range of SINGLE/DUP/RAID1/RAID1C*, and inside each
data stripe for RAID0/RAID10.

Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C*
profiles.
As one can see, the new entrance for those simple-mirror based profiles
can be small enough (with comments, just reach 100 lines).

This function will be the basis for the incoming scrub refactor.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 190 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 190 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 640f3de38e18..ddd069bd2375 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3280,6 +3280,172 @@ static int find_first_extent_item(struct btrfs_root *extent_root,
 	return 1;
 }
 
+static void get_extent_info(struct btrfs_path *path, u64 *extent_start_ret,
+			    u64 *size_ret, u64 *flags_ret, u64 *generation_ret)
+{
+	struct btrfs_key key;
+	struct btrfs_extent_item *ei;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	ASSERT(key.type == BTRFS_METADATA_ITEM_KEY ||
+	       key.type == BTRFS_EXTENT_ITEM_KEY);
+	*extent_start_ret = key.objectid;
+	if (key.type == BTRFS_METADATA_ITEM_KEY)
+		*size_ret = path->nodes[0]->fs_info->nodesize;
+	else
+		*size_ret = key.offset;
+	ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
+			    struct btrfs_extent_item);
+	*flags_ret = btrfs_extent_flags(path->nodes[0], ei);
+	*generation_ret = btrfs_extent_generation(path->nodes[0], ei);
+}
+				
+static bool does_range_cross_boundary(u64 extent_start, u64 extent_len,
+				      u64 boundary_start, u64 boudary_len)
+{
+	return (extent_start < boundary_start &&
+		extent_start + extent_len > boundary_start) ||
+	       (extent_start < boundary_start + boudary_len &&
+		extent_start + extent_len > boundary_start + boudary_len);
+}
+
+/*
+ * Scrub one range which can only has simple mirror based profile.
+ * (Including all range in SINGLE/DUP/RAID1/RAID1C*, and each stripe in
+ *  RAID0/RAID10).
+ *
+ * Since we may need to handle a subset of block group, we need @logical_start
+ * and @logical_length parameter.
+ */
+static int scrub_simple_mirror(struct scrub_ctx *sctx,
+				struct btrfs_root *extent_root,
+				struct btrfs_root *csum_root,
+				struct btrfs_block_group *bg,
+				struct map_lookup *map,
+				u64 logical_start, u64 logical_length,
+				struct btrfs_device *device,
+				u64 physical, int mirror_num)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	const u64 logical_end = logical_start + logical_length;
+	/* An artificial limit, inherit from old scrub behavior */
+	const u32 max_length = SZ_64K;
+	struct btrfs_path path = {};
+	u64 cur_logical = logical_start;
+	int ret;
+
+	/* The range must be inside the bg */
+	ASSERT(logical_start >= bg->start &&
+	       logical_end <= bg->start + bg->length);
+
+	path.search_commit_root = 1;
+	path.skip_locking = 1;
+	/* Go through each */
+	while (cur_logical < logical_end) {
+		int cur_mirror = mirror_num;
+		struct btrfs_device *target_dev = device;
+		u64 extent_start;
+		u64 extent_len;
+		u64 extent_flags;
+		u64 extent_gen;
+		u64 scrub_len;
+		u64 cur_physical;
+
+		/* Canceled ? */
+		if (atomic_read(&fs_info->scrub_cancel_req) ||
+		    atomic_read(&sctx->cancel_req)) {
+			ret = -ECANCELED;
+			break;
+		}
+		/* Paused ? */
+		if (atomic_read(&fs_info->scrub_pause_req)) {
+			/* Push queued extents */
+			sctx->flush_all_writes = true;
+			scrub_submit(sctx);
+			mutex_lock(&sctx->wr_lock);
+			scrub_wr_submit(sctx);
+			mutex_unlock(&sctx->wr_lock);
+			wait_event(sctx->list_wait,
+				   atomic_read(&sctx->bios_in_flight) == 0);
+			sctx->flush_all_writes = false;
+			scrub_blocked_if_needed(fs_info);
+		}
+		/* Block group removed? */
+		spin_lock(&bg->lock);
+		if (bg->removed) {
+			spin_unlock(&bg->lock);
+			ret = 0;
+			break;
+		}
+		spin_unlock(&bg->lock);
+
+		ret = find_first_extent_item(extent_root, &path, cur_logical,
+					     logical_end - cur_logical);
+		if (ret > 0) {
+			/* No more extent, just update the accounting */
+			sctx->stat.last_physical = physical + logical_length;
+			ret = 0;
+			break;
+		}
+		if (ret < 0)
+			break;
+		get_extent_info(&path, &extent_start, &extent_len,
+				&extent_flags, &extent_gen);
+		/* Skip hole range which doesn't have any extent */
+		cur_logical = max(extent_start, cur_logical);
+
+		/*
+		 * Scrub len has three limits:
+		 * - Extent size limit
+		 * - Scrub range limit
+		 *   This is especially imporatant for RAID0/RAID10 to reuse
+		 *   this function
+		 * - Max scrub size limit
+		 */
+		scrub_len = min(min(extent_start + extent_len,
+				    logical_end), cur_logical + max_length) -
+			    cur_logical;
+		cur_physical = cur_logical - logical_start + physical;
+
+		if (sctx->is_dev_replace)
+			scrub_remap_extent(fs_info, cur_logical, scrub_len,
+					   &cur_physical, &target_dev, &cur_mirror);
+		if (extent_flags & BTRFS_EXTENT_FLAG_DATA) {
+			ret = btrfs_lookup_csums_range(csum_root, cur_logical,
+					cur_logical + scrub_len - 1,
+					&sctx->csum_list, 1);
+			if (ret)
+				break;
+		}
+		if ((extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) &&
+		    does_range_cross_boundary(extent_start, extent_len,
+			    		      logical_start, logical_length)) {
+			btrfs_err(fs_info,
+"scrub: tree block %llu spanning boundaries, ignored. boundary=[%llu, %llu)",
+				  extent_start, logical_start, logical_end);
+			spin_lock(&sctx->stat_lock);
+			sctx->stat.uncorrectable_errors++;
+			spin_unlock(&sctx->stat_lock);
+			cur_logical += scrub_len;
+			continue;
+		}
+		ret = scrub_extent(sctx, map, cur_logical, scrub_len, cur_physical,
+				   target_dev, extent_flags, extent_gen,
+				   cur_mirror, cur_logical - logical_start +
+				   physical);
+		scrub_free_csums(sctx);
+		if (ret)
+			break;
+		if (sctx->is_dev_replace)
+			sync_replace_for_zoned(sctx);
+		cur_logical += scrub_len;
+		/* Don't hold CPU for too long time */
+		cond_resched();
+	}
+	btrfs_release_path(&path);
+	return ret;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct btrfs_block_group *bg,
 					   struct map_lookup *map,
@@ -3292,6 +3458,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	struct btrfs_root *csum_root;
 	struct btrfs_extent_item *extent;
 	struct blk_plug plug;
+	const u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
 	const u64 chunk_logical = bg->start;
 	u64 flags;
 	int ret;
@@ -3386,6 +3553,29 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		sctx->flush_all_writes = true;
 	}
 
+	/*
+	 * There used to be a big double loop to handle all profiles using the
+	 * same routine, which grows larger and more gross over time.
+	 *
+	 * So here we handle each profile differently, so simpler profiles
+	 * have simpler scrubing function.
+	 */
+	if (!(profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10 |
+			 BTRFS_BLOCK_GROUP_RAID56_MASK))) {
+		/*
+		 * Above check rules out all complex profile, the remaining
+		 * profiles are SINGLE|DUP|RAID1|RAID1C*, which is simple
+		 * mirrored duplication without stripe.
+		 *
+		 * Only @phsyical and @mirror_num needs to calculated using
+		 * @stripe_index.
+		 */
+		ret = scrub_simple_mirror(sctx, root, csum_root, bg, map,
+				bg->start, bg->length, scrub_dev,
+				map->stripes[stripe_index].physical,
+				stripe_index + 1);
+		goto out;
+	}
 	/*
 	 * now find all extents for each stripe and scrub them
 	 */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC 3/4] btrfs: introduce dedicated helper to scrub simple-stripe based range
  2021-12-21  2:33 [PATCH RFC 0/4] btrfs: refactor scrub entrances for each profile Qu Wenruo
  2021-12-21  2:33 ` [PATCH RFC 1/4] btrfs: introduce a helper to locate an extent item Qu Wenruo
  2021-12-21  2:33 ` [PATCH RFC 2/4] btrfs: introduce dedicated helper to scrub simple-mirror based range Qu Wenruo
@ 2021-12-21  2:33 ` Qu Wenruo
  2022-01-03 16:33   ` Josef Bacik
  2021-12-21  2:33 ` [PATCH RFC 4/4] btrfs: use scrub_simple_mirror() to handle RAID56 data stripe scrub Qu Wenruo
  3 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2021-12-21  2:33 UTC (permalink / raw)
  To: linux-btrfs

The new entrance will iterate through each data stripe which belongs to
the target device.

And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is
just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 60 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 57 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ddd069bd2375..aff9db6fbc7e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3446,6 +3446,55 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 	return ret;
 }
 
+static int scrub_simple_stripe(struct scrub_ctx *sctx,
+				struct btrfs_root *extent_root,
+				struct btrfs_root *csum_root,
+				struct btrfs_block_group *bg,
+				struct map_lookup *map,
+				struct btrfs_device *device,
+				int stripe_index)
+{
+	/* The increment of logical bytenr */
+	const u64 logical_increment = (map->num_stripes / map->sub_stripes) *
+				      map->stripe_len;
+	/*
+	 * Our starting logical bytenr needs to be calculated based on
+	 * stripe_index.
+	 *
+	 * stripe_index / sub_stripes gives how many data stripes we need to
+	 * skip.
+	 */
+	const u64 orig_logical = (stripe_index / map->sub_stripes) *
+				 map->stripe_len + bg->start;
+	const u64 orig_physical = map->stripes[stripe_index].physical;
+	/*
+	 * For RAID0, it's fixed to 1.
+	 * For RAID10, the mirror_num is always 0,1,0,1,...
+	 */
+	const int mirror_num = stripe_index % map->sub_stripes + 1;
+	u64 cur_logical = orig_logical;
+	u64 cur_physical = orig_physical;
+	int ret = 0;
+
+	while (cur_logical < bg->start + bg->length) {
+		/*
+		 * Inside each stripe, RAID0 is just SINGLE, and RAID10 is
+		 * just RAID1, so we can reuse scrub_simple_mirror() to scrub
+		 * this stripe.
+		 */
+		ret = scrub_simple_mirror(sctx, extent_root, csum_root, bg, map,
+					  cur_logical, map->stripe_len, device,
+					  cur_physical, mirror_num);
+		if (ret)
+			return ret;
+		/* Skip to next stripe which belongs to the target device */
+		cur_logical += logical_increment;
+		/* For physical offset, we just go to next stripe */
+		cur_physical += map->stripe_len;
+	}
+	return ret;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct btrfs_block_group *bg,
 					   struct map_lookup *map,
@@ -3576,9 +3625,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 				stripe_index + 1);
 		goto out;
 	}
-	/*
-	 * now find all extents for each stripe and scrub them
-	 */
+	if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10)) {
+		ret = scrub_simple_stripe(sctx, root, csum_root, bg, map,
+					  scrub_dev, stripe_index);
+		goto out;
+	}
+
+	/* Only RAID56 goes through the old code */
+	ASSERT(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK);
 	ret = 0;
 	while (physical < physical_end) {
 		/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC 4/4] btrfs: use scrub_simple_mirror() to handle RAID56 data stripe scrub
  2021-12-21  2:33 [PATCH RFC 0/4] btrfs: refactor scrub entrances for each profile Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-12-21  2:33 ` [PATCH RFC 3/4] btrfs: introduce dedicated helper to scrub simple-stripe " Qu Wenruo
@ 2021-12-21  2:33 ` Qu Wenruo
  3 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2021-12-21  2:33 UTC (permalink / raw)
  To: linux-btrfs

Although RAID56 has complex repair mechanism, which involves reading the
whole full stripe, but for current data stripe scrub, it's in fact no
different than SINGLE/RAID1.

The point here is, for data stripe we just check the csum for each
extent we hit.
Only for csum mismatch case, our repair path divides.

So we can still reuse scrub_simple_mirror() for RAID56 data stripes.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 330 +++++++----------------------------------------
 1 file changed, 49 insertions(+), 281 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index aff9db6fbc7e..c0c2e9bfd606 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3505,66 +3505,27 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	struct btrfs_root *root;
 	struct btrfs_root *csum_root;
-	struct btrfs_extent_item *extent;
 	struct blk_plug plug;
 	const u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
 	const u64 chunk_logical = bg->start;
-	u64 flags;
 	int ret;
-	int slot;
 	u64 nstripes;
-	struct extent_buffer *l;
 	u64 physical;
 	u64 logical;
 	u64 logic_end;
 	u64 physical_end;
-	u64 generation;
-	int mirror_num;
-	struct btrfs_key key;
-	u64 increment = map->stripe_len;
-	u64 offset;
-	u64 extent_logical;
-	u64 extent_physical;
-	/*
-	 * Unlike chunk length, extent length should never go beyond
-	 * BTRFS_MAX_EXTENT_SIZE, thus u32 is enough here.
-	 */
-	u32 extent_len;
+	u64 increment;	/* The logical increment after finishing one stripe */
+	u64 offset;	/* Offset inside the chunk */
 	u64 stripe_logical;
 	u64 stripe_end;
-	struct btrfs_device *extent_dev;
-	int extent_mirror_num;
 	int stop_loop = 0;
 
-	physical = map->stripes[stripe_index].physical;
-	offset = 0;
-	nstripes = div64_u64(dev_ext_len, map->stripe_len);
-	mirror_num = 1;
-	increment = map->stripe_len;
-	if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
-		offset = map->stripe_len * stripe_index;
-		increment = map->stripe_len * map->num_stripes;
-	} else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
-		int factor = map->num_stripes / map->sub_stripes;
-		offset = map->stripe_len * (stripe_index / map->sub_stripes);
-		increment = map->stripe_len * factor;
-		mirror_num = stripe_index % map->sub_stripes + 1;
-	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1_MASK) {
-		mirror_num = stripe_index % map->num_stripes + 1;
-	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
-		mirror_num = stripe_index % map->num_stripes + 1;
-	} else if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-		get_raid56_logic_offset(physical, stripe_index, map, &offset,
-					NULL);
-		increment = map->stripe_len * nr_data_stripes(map);
-	}
-
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 
 	/*
-	 * work on commit root. The related disk blocks are static as
+	 * Work on commit root. The related disk blocks are static as
 	 * long as COW is applied. This means, it is save to rewrite
 	 * them to repair disk errors without any race conditions
 	 */
@@ -3572,32 +3533,24 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	path->skip_locking = 1;
 	path->reada = READA_FORWARD;
 
-	logical = chunk_logical + offset;
-	physical_end = physical + nstripes * map->stripe_len;
-	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-		get_raid56_logic_offset(physical_end, stripe_index,
-					map, &logic_end, NULL);
-		logic_end += chunk_logical;
-	} else {
-		logic_end = logical + increment * nstripes;
-	}
 	wait_event(sctx->list_wait,
 		   atomic_read(&sctx->bios_in_flight) == 0);
 	scrub_blocked_if_needed(fs_info);
 
-	root = btrfs_extent_root(fs_info, logical);
-	csum_root = btrfs_csum_root(fs_info, logical);
+	root = btrfs_extent_root(fs_info, bg->start);
+	csum_root = btrfs_csum_root(fs_info, bg->start);
 
 	/*
-	 * collect all data csums for the stripe to avoid seeking during
+	 * Collect all data csums for the stripe to avoid seeking during
 	 * the scrub. This might currently (crc32) end up to be about 1MB
 	 */
 	blk_start_plug(&plug);
 
 	if (sctx->is_dev_replace &&
-	    btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
+	    btrfs_dev_is_sequential(sctx->wr_tgtdev,
+				    map->stripes[stripe_index].physical)) {
 		mutex_lock(&sctx->wr_lock);
-		sctx->write_pointer = physical;
+		sctx->write_pointer = map->stripes[stripe_index].physical;
 		mutex_unlock(&sctx->wr_lock);
 		sctx->flush_all_writes = true;
 	}
@@ -3633,239 +3586,54 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 
 	/* Only RAID56 goes through the old code */
 	ASSERT(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK);
+
+	physical = map->stripes[stripe_index].physical;
+	offset = 0;
+	nstripes = div64_u64(dev_ext_len, map->stripe_len);
+	get_raid56_logic_offset(physical, stripe_index, map, &offset, NULL);
+	increment = map->stripe_len * nr_data_stripes(map);
+
+	logical = chunk_logical + offset;
+	physical_end = physical + nstripes * map->stripe_len;
+	get_raid56_logic_offset(physical_end, stripe_index, map, &logic_end,
+				NULL);
+	logic_end += chunk_logical;
+
 	ret = 0;
+	/*
+	 * Due to the rotation, for RAID56 it's better to iterate each stripe
+	 * using their physical offset.
+	 */
 	while (physical < physical_end) {
-		/*
-		 * canceled?
-		 */
-		if (atomic_read(&fs_info->scrub_cancel_req) ||
-		    atomic_read(&sctx->cancel_req)) {
-			ret = -ECANCELED;
-			goto out;
+		ret = get_raid56_logic_offset(physical, stripe_index, map,
+					      &logical, &stripe_logical);
+		logical += chunk_logical;
+		if (ret) {
+			/* It is parity strip */
+			stripe_logical += chunk_logical;
+			stripe_end = stripe_logical + increment;
+			ret = scrub_raid56_parity(sctx, map, scrub_dev,
+						  stripe_logical,
+						  stripe_end);
+			if (ret)
+				goto out;
+			goto next;
 		}
+
 		/*
-		 * check to see if we have to pause
+		 * Now we're at data stripes, scrub each extents in the range.
+		 *
+		 * At this stage, if we ignore the repair part, each data stripe
+		 * is no different than SINGLE profile.
+		 * We can reuse scrub_simple_mirror() here, as the repair part
+		 * is still based on @mirror_num.
 		 */
-		if (atomic_read(&fs_info->scrub_pause_req)) {
-			/* push queued extents */
-			sctx->flush_all_writes = true;
-			scrub_submit(sctx);
-			mutex_lock(&sctx->wr_lock);
-			scrub_wr_submit(sctx);
-			mutex_unlock(&sctx->wr_lock);
-			wait_event(sctx->list_wait,
-				   atomic_read(&sctx->bios_in_flight) == 0);
-			sctx->flush_all_writes = false;
-			scrub_blocked_if_needed(fs_info);
-		}
-
-		if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-			ret = get_raid56_logic_offset(physical, stripe_index,
-						      map, &logical,
-						      &stripe_logical);
-			logical += chunk_logical;
-			if (ret) {
-				/* it is parity strip */
-				stripe_logical += chunk_logical;
-				stripe_end = stripe_logical + increment;
-				ret = scrub_raid56_parity(sctx, map, scrub_dev,
-							  stripe_logical,
-							  stripe_end);
-				if (ret)
-					goto out;
-				goto skip;
-			}
-		}
-
-		if (btrfs_fs_incompat(fs_info, SKINNY_METADATA))
-			key.type = BTRFS_METADATA_ITEM_KEY;
-		else
-			key.type = BTRFS_EXTENT_ITEM_KEY;
-		key.objectid = logical;
-		key.offset = (u64)-1;
-
-		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		ret = scrub_simple_mirror(sctx, root, csum_root, bg, map,
+					  logical, map->stripe_len,
+					  scrub_dev, physical, 1);
 		if (ret < 0)
 			goto out;
-
-		if (ret > 0) {
-			ret = btrfs_previous_extent_item(root, path, 0);
-			if (ret < 0)
-				goto out;
-			if (ret > 0) {
-				/* there's no smaller item, so stick with the
-				 * larger one */
-				btrfs_release_path(path);
-				ret = btrfs_search_slot(NULL, root, &key,
-							path, 0, 0);
-				if (ret < 0)
-					goto out;
-			}
-		}
-
-		stop_loop = 0;
-		while (1) {
-			u64 bytes;
-
-			l = path->nodes[0];
-			slot = path->slots[0];
-			if (slot >= btrfs_header_nritems(l)) {
-				ret = btrfs_next_leaf(root, path);
-				if (ret == 0)
-					continue;
-				if (ret < 0)
-					goto out;
-
-				stop_loop = 1;
-				break;
-			}
-			btrfs_item_key_to_cpu(l, &key, slot);
-
-			if (key.type != BTRFS_EXTENT_ITEM_KEY &&
-			    key.type != BTRFS_METADATA_ITEM_KEY)
-				goto next;
-
-			if (key.type == BTRFS_METADATA_ITEM_KEY)
-				bytes = fs_info->nodesize;
-			else
-				bytes = key.offset;
-
-			if (key.objectid + bytes <= logical)
-				goto next;
-
-			if (key.objectid >= logical + map->stripe_len) {
-				/* out of this device extent */
-				if (key.objectid >= logic_end)
-					stop_loop = 1;
-				break;
-			}
-
-			/*
-			 * If our block group was removed in the meanwhile, just
-			 * stop scrubbing since there is no point in continuing.
-			 * Continuing would prevent reusing its device extents
-			 * for new block groups for a long time.
-			 */
-			spin_lock(&bg->lock);
-			if (bg->removed) {
-				spin_unlock(&bg->lock);
-				ret = 0;
-				goto out;
-			}
-			spin_unlock(&bg->lock);
-
-			extent = btrfs_item_ptr(l, slot,
-						struct btrfs_extent_item);
-			flags = btrfs_extent_flags(l, extent);
-			generation = btrfs_extent_generation(l, extent);
-
-			if ((flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) &&
-			    (key.objectid < logical ||
-			     key.objectid + bytes >
-			     logical + map->stripe_len)) {
-				btrfs_err(fs_info,
-					   "scrub: tree block %llu spanning stripes, ignored. logical=%llu",
-				       key.objectid, logical);
-				spin_lock(&sctx->stat_lock);
-				sctx->stat.uncorrectable_errors++;
-				spin_unlock(&sctx->stat_lock);
-				goto next;
-			}
-
-again:
-			extent_logical = key.objectid;
-			ASSERT(bytes <= U32_MAX);
-			extent_len = bytes;
-
-			/*
-			 * trim extent to this stripe
-			 */
-			if (extent_logical < logical) {
-				extent_len -= logical - extent_logical;
-				extent_logical = logical;
-			}
-			if (extent_logical + extent_len >
-			    logical + map->stripe_len) {
-				extent_len = logical + map->stripe_len -
-					     extent_logical;
-			}
-
-			extent_physical = extent_logical - logical + physical;
-			extent_dev = scrub_dev;
-			extent_mirror_num = mirror_num;
-			if (sctx->is_dev_replace)
-				scrub_remap_extent(fs_info, extent_logical,
-						   extent_len, &extent_physical,
-						   &extent_dev,
-						   &extent_mirror_num);
-
-			if (flags & BTRFS_EXTENT_FLAG_DATA) {
-				ret = btrfs_lookup_csums_range(csum_root,
-						extent_logical,
-						extent_logical + extent_len - 1,
-						&sctx->csum_list, 1);
-				if (ret)
-					goto out;
-			}
-
-			ret = scrub_extent(sctx, map, extent_logical, extent_len,
-					   extent_physical, extent_dev, flags,
-					   generation, extent_mirror_num,
-					   extent_logical - logical + physical);
-
-			scrub_free_csums(sctx);
-
-			if (ret)
-				goto out;
-
-			if (sctx->is_dev_replace)
-				sync_replace_for_zoned(sctx);
-
-			if (extent_logical + extent_len <
-			    key.objectid + bytes) {
-				if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-					/*
-					 * loop until we find next data stripe
-					 * or we have finished all stripes.
-					 */
-loop:
-					physical += map->stripe_len;
-					ret = get_raid56_logic_offset(physical,
-							stripe_index, map,
-							&logical, &stripe_logical);
-					logical += chunk_logical;
-
-					if (ret && physical < physical_end) {
-						stripe_logical += chunk_logical;
-						stripe_end = stripe_logical +
-								increment;
-						ret = scrub_raid56_parity(sctx,
-							map, scrub_dev,
-							stripe_logical,
-							stripe_end);
-						if (ret)
-							goto out;
-						goto loop;
-					}
-				} else {
-					physical += map->stripe_len;
-					logical += increment;
-				}
-				if (logical < key.objectid + bytes) {
-					cond_resched();
-					goto again;
-				}
-
-				if (physical >= physical_end) {
-					stop_loop = 1;
-					break;
-				}
-			}
 next:
-			path->slots[0]++;
-		}
-		btrfs_release_path(path);
-skip:
 		logical += increment;
 		physical += map->stripe_len;
 		spin_lock(&sctx->stat_lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 3/4] btrfs: introduce dedicated helper to scrub simple-stripe based range
  2021-12-21  2:33 ` [PATCH RFC 3/4] btrfs: introduce dedicated helper to scrub simple-stripe " Qu Wenruo
@ 2022-01-03 16:33   ` Josef Bacik
  2022-01-03 23:44     ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2022-01-03 16:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Dec 21, 2021 at 10:33:48AM +0800, Qu Wenruo wrote:
> The new entrance will iterate through each data stripe which belongs to
> the target device.
> 
> And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is
> just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/scrub.c | 60 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 57 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ddd069bd2375..aff9db6fbc7e 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3446,6 +3446,55 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
>  	return ret;
>  }
>  
> +static int scrub_simple_stripe(struct scrub_ctx *sctx,
> +				struct btrfs_root *extent_root,
> +				struct btrfs_root *csum_root,
> +				struct btrfs_block_group *bg,
> +				struct map_lookup *map,
> +				struct btrfs_device *device,
> +				int stripe_index)
> +{
> +	/* The increment of logical bytenr */
> +	const u64 logical_increment = (map->num_stripes / map->sub_stripes) *
> +				      map->stripe_len;
> +	/*
> +	 * Our starting logical bytenr needs to be calculated based on
> +	 * stripe_index.
> +	 *
> +	 * stripe_index / sub_stripes gives how many data stripes we need to
> +	 * skip.
> +	 */
> +	const u64 orig_logical = (stripe_index / map->sub_stripes) *
> +				 map->stripe_len + bg->start;
> +	const u64 orig_physical = map->stripes[stripe_index].physical;
> +	/*
> +	 * For RAID0, it's fixed to 1.
> +	 * For RAID10, the mirror_num is always 0,1,0,1,...
> +	 */
> +	const int mirror_num = stripe_index % map->sub_stripes + 1;

For me I don't like having big comment blocks in the variable declaration part,
it makes it hard to read.

Also we have this sort of map math in a variety of different places.  If it's
complex enough that it needs a comment then I think it would be good to have
static inline helpers with the math required to get the information we want,
which comments that exist there.  Probably not needed for every little peice of
math, but for the common ones we do all the time it would be good.

If it doesn't make sense for any of the above things I would prefer to see
something like

/*
 * logical_increment - the the size to increment based on the stripe size.
 * orig_logical - the actual logical bytenr based on the stripe we're scrubbing.
 * <etc>
 */
static int scrub_simple_stripe()

I'm not 100% married to this, it's purely a aesthetic opinion.  If I'm the
minority then that's ok, but the above doesn't look great to me.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 3/4] btrfs: introduce dedicated helper to scrub simple-stripe based range
  2022-01-03 16:33   ` Josef Bacik
@ 2022-01-03 23:44     ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2022-01-03 23:44 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo; +Cc: linux-btrfs



On 2022/1/4 00:33, Josef Bacik wrote:
> On Tue, Dec 21, 2021 at 10:33:48AM +0800, Qu Wenruo wrote:
>> The new entrance will iterate through each data stripe which belongs to
>> the target device.
>>
>> And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is
>> just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/scrub.c | 60 +++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 57 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index ddd069bd2375..aff9db6fbc7e 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3446,6 +3446,55 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
>>   	return ret;
>>   }
>>
>> +static int scrub_simple_stripe(struct scrub_ctx *sctx,
>> +				struct btrfs_root *extent_root,
>> +				struct btrfs_root *csum_root,
>> +				struct btrfs_block_group *bg,
>> +				struct map_lookup *map,
>> +				struct btrfs_device *device,
>> +				int stripe_index)
>> +{
>> +	/* The increment of logical bytenr */
>> +	const u64 logical_increment = (map->num_stripes / map->sub_stripes) *
>> +				      map->stripe_len;
>> +	/*
>> +	 * Our starting logical bytenr needs to be calculated based on
>> +	 * stripe_index.
>> +	 *
>> +	 * stripe_index / sub_stripes gives how many data stripes we need to
>> +	 * skip.
>> +	 */
>> +	const u64 orig_logical = (stripe_index / map->sub_stripes) *
>> +				 map->stripe_len + bg->start;
>> +	const u64 orig_physical = map->stripes[stripe_index].physical;
>> +	/*
>> +	 * For RAID0, it's fixed to 1.
>> +	 * For RAID10, the mirror_num is always 0,1,0,1,...
>> +	 */
>> +	const int mirror_num = stripe_index % map->sub_stripes + 1;
>
> For me I don't like having big comment blocks in the variable declaration part,
> it makes it hard to read.
>
> Also we have this sort of map math in a variety of different places.  If it's
> complex enough that it needs a comment then I think it would be good to have
> static inline helpers with the math required to get the information we want,
> which comments that exist there.  Probably not needed for every little peice of
> math, but for the common ones we do all the time it would be good.

That makes sense.

Some helpers would definitely makes more sense.

Thanks,
Qu
>
> If it doesn't make sense for any of the above things I would prefer to see
> something like
>
> /*
>   * logical_increment - the the size to increment based on the stripe size.
>   * orig_logical - the actual logical bytenr based on the stripe we're scrubbing.
>   * <etc>
>   */
> static int scrub_simple_stripe()
>
> I'm not 100% married to this, it's purely a aesthetic opinion.  If I'm the
> minority then that's ok, but the above doesn't look great to me.  Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-03 23:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-21  2:33 [PATCH RFC 0/4] btrfs: refactor scrub entrances for each profile Qu Wenruo
2021-12-21  2:33 ` [PATCH RFC 1/4] btrfs: introduce a helper to locate an extent item Qu Wenruo
2021-12-21  2:33 ` [PATCH RFC 2/4] btrfs: introduce dedicated helper to scrub simple-mirror based range Qu Wenruo
2021-12-21  2:33 ` [PATCH RFC 3/4] btrfs: introduce dedicated helper to scrub simple-stripe " Qu Wenruo
2022-01-03 16:33   ` Josef Bacik
2022-01-03 23:44     ` Qu Wenruo
2021-12-21  2:33 ` [PATCH RFC 4/4] btrfs: use scrub_simple_mirror() to handle RAID56 data stripe scrub Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.