All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag
@ 2023-07-03  7:32 Qu Wenruo
  2023-07-03  7:32 ` [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio() Qu Wenruo
                   ` (15 more replies)
  0 siblings, 16 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
RFC->v1:
- Add RAID56 support
  Which is the biggest advantage of the new scrub mode.

- More basic tests around various repair situations

[REPO]
Please fetch from github repo:
https://github.com/adam900710/linux/tree/scrub_logical

This is based on David's misc-next, but still has one extra regression
fix which is not yet merged. ("btrfs: raid56: always verify the P/Q
contents for scrub")

[BACKGROUND]
Scrub originally works in a per-device basis, that means if we want to
scrub the full fs, we would queue a scrub for each writeable device.

This is normally fine, but some behavior is not ideal like the
following:
		X	X+16K	X+32K
 Mirror 1	|XXXXXXX|       |
 Mirror 2	|       |XXXXXXX|

When scrubbing mirror 1, we found [X, X+16K) has corruption, then we
will try to read from mirror 2 and repair using the correct data.

This applies to range [X+16K, X+32K) of mirror 2, causing the good copy
to be read twice, which may or may not be a big deal.

But the problem can easily go crazy for RAID56, as its repair requires
the full stripe to be read out, so is its P/Q verification, e.g.:

		X	X+16K	X+32K
 Data stripe 1	|XXXXXXX|       |	|	|
 Data stripe 2	|       |XXXXXXX|	|	|
 Parity stripe	|       |	|XXXXXXX|	|

In above case, if we're scrubbing all mirrors, we would read the same
contents again and again:

Scrub on mirror 1:
- Read data stripe 1 for the initial read.
- Read data stripes 1 + 2 + parity for the rebuild.

Scrub on mirror 2:
- Read data stripe 2 for the initial read.
- Read data stripes 1 + 2 + parity for the rebuild.

Scrub on Parity
- Read data stripes 1 + 2 for the data stripes verification
- Read data stripes 1 + 2 + parity for the data rebuild
  This step is already improved by recent optimization to use cached
  stripes.
- Read the parity stripe for the final verification

So for data stripes, they are read *five* times, and *three* times for
parity stripes.

[ENHANCEMENT]
Instead of the existing per-device scrub, this patchset introduce a new
flag, SCRUB_LOGICAL, to do logical address space based scrub.

Unlike per-device scrub, this new flag would make scrub a per-fs
operation.

This allows us to scrub the different mirrors in one go, and avoid
unnecessary read to repair the corruption.

This means, no matter what profile, they always read the same data just
once.

This also makes user space handling much simpler, just one ioctl to
scrub the full fs, without the current multi-thread scrub code.

[TODO]
- More testing
  Currently only done functional tests, no stress tests yet.

- Better progs integration
  In theory we can silently try SCRUB_LOGICAL first, if the kernel
  doesn't support it, then fallback to the old multi-device scrub.

  But for current testing, a dedicated option is still assigned for
  scrub subcommand.

  And currently it only supports full report, no summary nor scrub file
  support.

Qu Wenruo (14):
  btrfs: raid56: remove unnecessary parameter for
    raid56_parity_alloc_scrub_rbio()
  btrfs: raid56: allow scrub operation to update both P and Q stripes
  btrfs: raid56: allow caching P/Q stripes
  btrfs: raid56: add the interfaces to submit recovery rbio
  btrfs: add the ability to read P/Q stripes directly
  btrfs: scrub: make scrub_ctx::stripes dynamically allocated
  btrfs: scrub: introduce the skeleton for logical-scrub
  btrfs: scrub: extract the common preparation before scrubbing a block
    group
  btrfs: scrub: implement the chunk iteration code for scrub_logical
  btrfs: scrub: implement the basic extent iteration code
  btrfs: scrub: implement the repair part of scrub logical
  btrfs: scrub: add RAID56 support for queue_scrub_logical_stripes()
  btrfs: scrub: introduce the RAID56 data recovery path for scrub
    logical
  btrfs: scrub: implement the RAID56 P/Q scrub code

 fs/btrfs/disk-io.c         |    1 +
 fs/btrfs/fs.h              |   12 +
 fs/btrfs/ioctl.c           |    6 +-
 fs/btrfs/raid56.c          |  134 +++-
 fs/btrfs/raid56.h          |   17 +-
 fs/btrfs/scrub.c           | 1198 ++++++++++++++++++++++++++++++------
 fs/btrfs/scrub.h           |    2 +
 fs/btrfs/volumes.c         |   50 +-
 fs/btrfs/volumes.h         |   16 +
 include/uapi/linux/btrfs.h |   11 +-
 10 files changed, 1206 insertions(+), 241 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio()
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03 12:33   ` Anand Jain
  2023-07-12 16:23   ` Christoph Hellwig
  2023-07-03  7:32 ` [PATCH 02/14] btrfs: raid56: allow scrub operation to update both P and Q stripes Qu Wenruo
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

The parameter @stripe_nsectors is easily calculated using
BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits, and is already cached in
rbio->stripe_nsectors.

Thus we don't need to pass that parameter from the caller, and can
remove it safely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 4 ++--
 fs/btrfs/raid56.h | 2 +-
 fs/btrfs/scrub.c  | 3 +--
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7d1cb4dd0f2c..cc783da065b4 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2335,7 +2335,7 @@ static void rmw_rbio_work_locked(struct work_struct *work)
 struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 				struct btrfs_io_context *bioc,
 				struct btrfs_device *scrub_dev,
-				unsigned long *dbitmap, int stripe_nsectors)
+				unsigned long *dbitmap)
 {
 	struct btrfs_fs_info *fs_info = bioc->fs_info;
 	struct btrfs_raid_bio *rbio;
@@ -2365,7 +2365,7 @@ struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 	}
 	ASSERT(i < rbio->real_stripes);
 
-	bitmap_copy(&rbio->dbitmap, dbitmap, stripe_nsectors);
+	bitmap_copy(&rbio->dbitmap, dbitmap, rbio->stripe_nsectors);
 	return rbio;
 }
 
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 45e6ff78316f..1871a2a16a1c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -189,7 +189,7 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc);
 struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 				struct btrfs_io_context *bioc,
 				struct btrfs_device *scrub_dev,
-				unsigned long *dbitmap, int stripe_nsectors);
+				unsigned long *dbitmap);
 void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio);
 
 void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 38c103f13fd5..5a0626a3c23c 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1914,8 +1914,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 		btrfs_bio_counter_dec(fs_info);
 		goto out;
 	}
-	rbio = raid56_parity_alloc_scrub_rbio(bio, bioc, scrub_dev, &extent_bitmap,
-				BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits);
+	rbio = raid56_parity_alloc_scrub_rbio(bio, bioc, scrub_dev, &extent_bitmap);
 	btrfs_put_bioc(bioc);
 	if (!rbio) {
 		ret = -ENOMEM;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/14] btrfs: raid56: allow scrub operation to update both P and Q stripes
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
  2023-07-03  7:32 ` [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio() Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-12 16:24   ` Christoph Hellwig
  2023-07-03  7:32 ` [PATCH 03/14] btrfs: raid56: allow caching P/Q stripes Qu Wenruo
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

Current the scrub_rbio() workload (used by
raid56_parity_alloc_scrub_rbio() then
raid56_parity_submit_scrub_rbio()) can only handle one P/Q stripe
workload.

But inside finish_parity_scrub() we always do the P/Q generation and
verification for both P/Q stripes.

For the incoming scrub_logical feature, we want to verify P/Q stripes in
one go, so here we introduce the ability to scrub both P/Q stripes.

To do that the following things are modified:

- raid56_parity_alloc_scrub_rbio() to allows empty @scrub_dev
  Thankfully we don't have any extra sanity checks to reject such
  empty parameter.

- Verify both P and Q stripes if @rbio->scrubp is 0
  Now we do the verification in a loop, and skip the stripe if
  rbio->scrubp is not zero and does not match the stripe number.

- Add bad sectors of both P/Q stripes to writeback
  Unfortunately we're using the same dbitmap, this means if only one
  of the P/Q stripes is corrupted, we would write both.
  However we can accept the slightly extra cost for it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 56 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index cc783da065b4..d96bfc3a16fc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2363,7 +2363,14 @@ struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 			break;
 		}
 	}
-	ASSERT(i < rbio->real_stripes);
+	/*
+	 * If @scrub_dev is specified, @scrubp must be the stripe number of
+	 * P or Q stripe.
+	 */
+	if (scrub_dev) {
+		ASSERT(rbio->scrubp >= rbio->nr_data);
+		ASSERT(rbio->scrubp < rbio->real_stripes);
+	}
 
 	bitmap_copy(&rbio->dbitmap, dbitmap, rbio->stripe_nsectors);
 	return rbio;
@@ -2426,7 +2433,8 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 	 * Replace is running and our P/Q stripe is being replaced, then we
 	 * need to duplicate the final write to replace target.
 	 */
-	if (bioc->replace_nr_stripes && bioc->replace_stripe_src == rbio->scrubp) {
+	if (rbio->scrubp && bioc->replace_nr_stripes &&
+	    bioc->replace_stripe_src == rbio->scrubp) {
 		is_replace = 1;
 		bitmap_copy(pbitmap, &rbio->dbitmap, rbio->stripe_nsectors);
 	}
@@ -2464,6 +2472,7 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 
 	for_each_set_bit(sectornr, &rbio->dbitmap, rbio->stripe_nsectors) {
 		struct sector_ptr *sector;
+		bool found_error = false;
 		void *parity;
 
 		/* first collect one page from each data stripe */
@@ -2484,14 +2493,22 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 		}
 
 		/* Check scrubbing parity and repair it */
-		sector = rbio_stripe_sector(rbio, rbio->scrubp, sectornr);
-		parity = kmap_local_page(sector->page) + sector->pgoff;
-		if (memcmp(parity, pointers[rbio->scrubp], sectorsize) != 0)
-			memcpy(parity, pointers[rbio->scrubp], sectorsize);
-		else
+		for (int i = rbio->nr_data; i < rbio->real_stripes; i++) {
+			/* Skip this stripe if it's not our scrub target. */
+			if (rbio->scrubp && i != rbio->scrubp)
+				continue;
+
+			sector = rbio_stripe_sector(rbio, i, sectornr);
+			parity = kmap_local_page(sector->page) + sector->pgoff;
+			if (memcmp(parity, pointers[i], sectorsize) != 0) {
+				memcpy(parity, pointers[i], sectorsize);
+				found_error = true;
+			}
+			kunmap_local(parity);
+		}
+		if (!found_error)
 			/* Parity is right, needn't writeback */
 			bitmap_clear(&rbio->dbitmap, sectornr, 1);
-		kunmap_local(parity);
 
 		for (stripe = nr_data - 1; stripe >= 0; stripe--)
 			kunmap_local(pointers[stripe]);
@@ -2514,11 +2531,16 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 	for_each_set_bit(sectornr, &rbio->dbitmap, rbio->stripe_nsectors) {
 		struct sector_ptr *sector;
 
-		sector = rbio_stripe_sector(rbio, rbio->scrubp, sectornr);
-		ret = rbio_add_io_sector(rbio, &bio_list, sector, rbio->scrubp,
-					 sectornr, REQ_OP_WRITE);
-		if (ret)
-			goto cleanup;
+		for (int i = rbio->nr_data; i < rbio->real_stripes; i++) {
+			/* Skip this stripe if it's not our scrub target. */
+			if (rbio->scrubp && i != rbio->scrubp)
+				continue;
+			sector = rbio_stripe_sector(rbio, i, sectornr);
+			ret = rbio_add_io_sector(rbio, &bio_list, sector, i,
+						 sectornr, REQ_OP_WRITE);
+			if (ret)
+				goto cleanup;
+		}
 	}
 
 	if (!is_replace)
@@ -2529,6 +2551,12 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 	 * the target device.  Check we have a valid source stripe number.
 	 */
 	ASSERT(rbio->bioc->replace_stripe_src >= 0);
+	/*
+	 * For dev-replace case, scrubp must be provided by the caller at
+	 * raid56_parity_alloc_scrub_rbio().
+	 */
+	ASSERT(rbio->scrubp >= rbio->nr_data);
+	ASSERT(rbio->scrubp < rbio->real_stripes);
 	for_each_set_bit(sectornr, pbitmap, rbio->stripe_nsectors) {
 		struct sector_ptr *sector;
 
@@ -2625,7 +2653,7 @@ static int recover_scrub_rbio(struct btrfs_raid_bio *rbio)
 		 * scrubbing parity, luckily, use the other one to repair the
 		 * data, or we can not repair the data stripe.
 		 */
-		if (failp != rbio->scrubp) {
+		if (rbio->scrubp && failp != rbio->scrubp) {
 			ret = -EIO;
 			goto out;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/14] btrfs: raid56: allow caching P/Q stripes
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
  2023-07-03  7:32 ` [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio() Qu Wenruo
  2023-07-03  7:32 ` [PATCH 02/14] btrfs: raid56: allow scrub operation to update both P and Q stripes Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-12 16:27   ` Christoph Hellwig
  2023-07-03  7:32 ` [PATCH 04/14] btrfs: raid56: add the interfaces to submit recovery rbio Qu Wenruo
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

Currently raid56_parity_cache_scrub_pages() only allows to cache data
stripes, as the only scrub call site is only going to scrub P/Q stripes.

But later we want to use cached pages to do recovery, thus we need to
cache P/Q stripes.

This patch would do the following changes to allow such new ability by:

- Rename the function to raid56_parity_cache_pages()

- Use @stripe_num to indicate where the cached stripe is

- Change the ASSERT() to allow P/Q stripes to be cached

- Introduce a new member, btrfs_raid_bio::cached, to allow recovery path
  to use cached pages

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 43 ++++++++++++++++++++++++++++++-------------
 fs/btrfs/raid56.h | 12 +++++++++---
 fs/btrfs/scrub.c  |  3 +--
 3 files changed, 40 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d96bfc3a16fc..b2eb7e60a137 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1932,11 +1932,10 @@ static void recover_rbio(struct btrfs_raid_bio *rbio)
 
 	/*
 	 * Read everything that hasn't failed. However this time we will
-	 * not trust any cached sector.
+	 * not trust any cached sector, unless it's explicitly required.
+	 *
 	 * As we may read out some stale data but higher layer is not reading
 	 * that stale part.
-	 *
-	 * So here we always re-read everything in recovery path.
 	 */
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
@@ -1960,6 +1959,14 @@ static void recover_rbio(struct btrfs_raid_bio *rbio)
 		}
 
 		sector = rbio_stripe_sector(rbio, stripe, sectornr);
+
+		/*
+		 * If we're forced to use cache and the sector is already cached,
+		 * then we can skip this sector.
+		 * Otherwise we should still read from disk.
+		 */
+		if (rbio->cached && sector->uptodate)
+			continue;
 		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
 					 sectornr, REQ_OP_READ);
 		if (ret < 0) {
@@ -2765,22 +2772,32 @@ void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
 }
 
 /*
- * This is for scrub call sites where we already have correct data contents.
- * This allows us to avoid reading data stripes again.
+ * This is for scrub call sites where we already have correct stripe contents.
+ * This allows us to avoid reading on-disk stripes again.
  *
  * Unfortunately here we have to do page copy, other than reusing the pages.
  * This is due to the fact rbio has its own page management for its cache.
+ * But this is also a good thing for recovery tries, this prevents the recovery
+ * path to modify the stripes until we've verified the recovered data.
+ *
+ * @rbio:	The allocated rbio by raid56_parity_alloc_*_rbio()
+ * @pages:	The pages which contains the stripe contents.
+ * @stripe_num:	The stripe number, 0 means the first data stripe, and
+ *		@rbio->nr_data means the P stripe.
  */
-void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
-				    struct page **data_pages, u64 data_logical)
+void raid56_parity_cache_pages(struct btrfs_raid_bio *rbio, struct page **pages,
+			       int stripe_num)
 {
-	const u64 offset_in_full_stripe = data_logical -
-					  rbio->bioc->full_stripe_logical;
+	const u64 offset_in_full_stripe = btrfs_stripe_nr_to_offset(stripe_num);
 	const int page_index = offset_in_full_stripe >> PAGE_SHIFT;
 	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
 	const u32 sectors_per_page = PAGE_SIZE / sectorsize;
 	int ret;
 
+	if (stripe_num >= rbio->nr_data)
+		ret = alloc_rbio_parity_pages(rbio);
+	else
+		ret = alloc_rbio_data_pages(rbio);
 	/*
 	 * If we hit ENOMEM temporarily, but later at
 	 * raid56_parity_submit_scrub_rbio() time it succeeded, we just do
@@ -2789,17 +2806,16 @@ void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
 	 * If we hit ENOMEM later at raid56_parity_submit_scrub_rbio() time,
 	 * the bio would got proper error number set.
 	 */
-	ret = alloc_rbio_data_pages(rbio);
 	if (ret < 0)
 		return;
 
-	/* data_logical must be at stripe boundary and inside the full stripe. */
+	/* The stripe must be at stripe boundary and inside the full stripe. */
 	ASSERT(IS_ALIGNED(offset_in_full_stripe, BTRFS_STRIPE_LEN));
-	ASSERT(offset_in_full_stripe < btrfs_stripe_nr_to_offset(rbio->nr_data));
+	ASSERT(offset_in_full_stripe < btrfs_stripe_nr_to_offset(rbio->real_stripes));
 
 	for (int page_nr = 0; page_nr < (BTRFS_STRIPE_LEN >> PAGE_SHIFT); page_nr++) {
 		struct page *dst = rbio->stripe_pages[page_nr + page_index];
-		struct page *src = data_pages[page_nr];
+		struct page *src = pages[page_nr];
 
 		memcpy_page(dst, 0, src, 0, PAGE_SIZE);
 		for (int sector_nr = sectors_per_page * page_index;
@@ -2807,4 +2823,5 @@ void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
 		     sector_nr++)
 			rbio->stripe_sectors[sector_nr].uptodate = true;
 	}
+	rbio->cached = true;
 }
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 1871a2a16a1c..41354a9f158a 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -82,6 +82,13 @@ struct btrfs_raid_bio {
 	 */
 	int bio_list_bytes;
 
+	/*
+	 * If this rbio is forced to use cached stripes provided by the caller.
+	 *
+	 * Used by scrub path to reduce IO.
+	 */
+	bool cached;
+
 	refcount_t refs;
 
 	atomic_t stripes_pending;
@@ -191,9 +198,8 @@ struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 				struct btrfs_device *scrub_dev,
 				unsigned long *dbitmap);
 void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio);
-
-void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
-				    struct page **data_pages, u64 data_logical);
+void raid56_parity_cache_pages(struct btrfs_raid_bio *rbio, struct page **pages,
+			       int stripe_num);
 
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 5a0626a3c23c..1864856abb13 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1925,8 +1925,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 	for (int i = 0; i < data_stripes; i++) {
 		stripe = &sctx->raid56_data_stripes[i];
 
-		raid56_parity_cache_data_pages(rbio, stripe->pages,
-				full_stripe_start + btrfs_stripe_nr_to_offset(i));
+		raid56_parity_cache_pages(rbio, stripe->pages, i);
 	}
 	raid56_parity_submit_scrub_rbio(rbio);
 	wait_for_completion_io(&io_done);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/14] btrfs: raid56: add the interfaces to submit recovery rbio
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (2 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 03/14] btrfs: raid56: allow caching P/Q stripes Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 05/14] btrfs: add the ability to read P/Q stripes directly Qu Wenruo
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

The incoming scrub_logical would need to recover raid56 data sectors
with cached pages.

This means we can not go regular btrfs_submit_bio() path, but go a
similar path like raid56_parity_alloc_scrub_rbio().

So this patch adds the following new functions to allow recover rbio to
be allocated and submitted out of the btrfs_submit_bio() path:

- raid56_parity_alloc_recover_rbio()
- raid56_parity_submit_scrub_rbio()

This means now we can go a full cached recover without reading any pages
from disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 31 +++++++++++++++++++++++++++++++
 fs/btrfs/raid56.h |  3 +++
 2 files changed, 34 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b2eb7e60a137..0340220463c6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2383,6 +2383,31 @@ struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 	return rbio;
 }
 
+/*
+ * Alloc a recovery rbio out of the regular btrfs_submit_bio() path.
+ *
+ * This allows scrub caller to use cached pages to reduce IO.
+ */
+struct btrfs_raid_bio *raid56_parity_alloc_recover_rbio(struct bio *bio,
+				struct btrfs_io_context *bioc, int mirror_num)
+{
+	struct btrfs_fs_info *fs_info = bioc->fs_info;
+	struct btrfs_raid_bio *rbio;
+
+	rbio = alloc_rbio(fs_info, bioc);
+	if (IS_ERR(rbio))
+		return NULL;
+	/* We should have some sectors that really need to be recovered. */
+	ASSERT(bio->bi_iter.bi_size);
+	bio_list_add(&rbio->bio_list, bio);
+	set_rbio_range_error(rbio, bio);
+	rbio->operation = BTRFS_RBIO_READ_REBUILD;
+	if (mirror_num > 2)
+		set_rbio_raid6_extra_error(rbio, mirror_num);
+
+	return rbio;
+}
+
 /*
  * We just scrub the parity that we have correct data on the same horizontal,
  * so we needn't allocate all pages for all the stripes.
@@ -2771,6 +2796,12 @@ void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
 		start_async_work(rbio, scrub_rbio_work_locked);
 }
 
+void raid56_parity_submit_recover_rbio(struct btrfs_raid_bio *rbio)
+{
+	if (!lock_stripe_add(rbio))
+		start_async_work(rbio, recover_rbio_work_locked);
+}
+
 /*
  * This is for scrub call sites where we already have correct stripe contents.
  * This allows us to avoid reading on-disk stripes again.
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 41354a9f158a..6f146eb86832 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -197,7 +197,10 @@ struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 				struct btrfs_io_context *bioc,
 				struct btrfs_device *scrub_dev,
 				unsigned long *dbitmap);
+struct btrfs_raid_bio *raid56_parity_alloc_recover_rbio(struct bio *bio,
+				struct btrfs_io_context *bioc, int mirror_num);
 void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio);
+void raid56_parity_submit_recover_rbio(struct btrfs_raid_bio *rbio);
 void raid56_parity_cache_pages(struct btrfs_raid_bio *rbio, struct page **pages,
 			       int stripe_num);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/14] btrfs: add the ability to read P/Q stripes directly
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (3 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 04/14] btrfs: raid56: add the interfaces to submit recovery rbio Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  9:46   ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 06/14] btrfs: scrub: make scrub_ctx::stripes dynamically allocated Qu Wenruo
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

Currently there is no way to read P/Q stripes of a RAID56 full stripe
directly.

Normally caller should call btrfs_map_block() directly and fetch the
physical location directly of the P/Q stripes.

But for the recent scrub rework, it's no longer that elegant as the full
scrub code is based on mirror_num based solution, thus this patch would
add two special mirror_num for this usages:

- Mirror -1 for P stripes
- Mirror -2 for Q stripes

Both mirrors only support read for now, and caller should make sure the
range start is stripe aligned.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/volumes.c | 30 +++++++++++++++++++++++++++++-
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 1864856abb13..41c514db0793 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1601,7 +1601,7 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
 	int mirror = stripe->mirror_num;
 
 	ASSERT(stripe->bg);
-	ASSERT(stripe->mirror_num > 0);
+	ASSERT(stripe->mirror_num);
 	ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
 
 	bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index cd632b3f579f..c22007bd830b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6251,10 +6251,12 @@ static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *
  *
  *			For RAID56 profile, mirror 1 means rebuild from P and
  *			the remaining data stripes.
+ *			And mirror -1 means read P stripes directly, -2 means
+ *			read Q stripes directly.
  *
  *			For RAID6 profile, mirror > 2 means mark another
  *			data/P stripe error and rebuild from the remaining
- *			stripes..
+ *			stripes.
  *
  * @need_raid_map:	(Used only for integrity checker) whether the map wants
  *                      a full stripe map (including all data and P/Q stripes)
@@ -6297,6 +6299,12 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		goto out_free_em;
 	}
 
+	/* Only allow mirror_num < 0 for RAID56. */
+	if (mirror_num < 0 && !(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)) {
+		free_extent_map(em);
+		return -EINVAL;
+	}
+
 	data_stripes = nr_data_stripes(map);
 
 	map_offset = logical - em->start;
@@ -6382,6 +6390,26 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 				  logical;
 			stripe_index = 0;
 			stripe_offset = 0;
+		} else if (mirror_num < 0) {
+			/* Only allow READ for direct P/Q operations. */
+			ASSERT(op == BTRFS_MAP_READ);
+			/*
+			 * Caller should make sure the range start is stripe
+			 * aligned.
+			 */
+			ASSERT(stripe_offset == 0);
+
+			/*
+			 * Stripes of @bioc is already sorted, stripes[0] is the
+			 * first data stripe and stripes[@data_stripes] is the
+			 * P stripe.
+			 * So we only need to update the @stripe_index to the
+			 * specified stripe, and set @stripe_nr/@stripe_offset
+			 * to 0, so we can return the beginning of the P/Q stripe.
+			 */
+			stripe_offset = 0;
+			stripe_nr = 0;
+			stripe_index = data_stripes + (-mirror_num - 1);
 		} else {
 			/*
 			 * Mirror #0 or #1 means the original data block.
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/14] btrfs: scrub: make scrub_ctx::stripes dynamically allocated
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (4 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 05/14] btrfs: add the ability to read P/Q stripes directly Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 07/14] btrfs: scrub: introduce the skeleton for logical-scrub Qu Wenruo
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Currently scrub_ctx::stripes are a fixed size array, this is fine for
most use cases, but later we may want to allocate one larger sized array
for logical bytenr based scrub.

So here we change the member to a dynamically allocated array.

This also affects the lifespan of the member.
Now it only needs to be allocated and initialized at the beginning of
scrub_stripe() function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 67 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 51 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 41c514db0793..1e49bb066619 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -172,7 +172,7 @@ struct scrub_stripe {
 };
 
 struct scrub_ctx {
-	struct scrub_stripe	stripes[SCRUB_STRIPES_PER_SCTX];
+	struct scrub_stripe	*stripes;
 	struct scrub_stripe	*raid56_data_stripes;
 	struct btrfs_fs_info	*fs_info;
 	int			first_free;
@@ -181,6 +181,9 @@ struct scrub_ctx {
 	int			readonly;
 	int			sectors_per_bio;
 
+	/* Number of stripes we have in @stripes. */
+	unsigned int		nr_stripes;
+
 	/* State of IO submission throttling affecting the associated device */
 	ktime_t			throttle_deadline;
 	u64			throttle_sent;
@@ -308,16 +311,24 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
 	scrub_pause_off(fs_info);
 }
 
+static void free_scrub_stripes(struct scrub_ctx *sctx)
+{
+	if (!sctx->stripes)
+		return;
+
+	for (int i = 0; i < sctx->nr_stripes; i++)
+		release_scrub_stripe(&sctx->stripes[i]);
+	kfree(sctx->stripes);
+	sctx->nr_stripes = 0;
+	sctx->stripes = NULL;
+}
+
 static noinline_for_stack void scrub_free_ctx(struct scrub_ctx *sctx)
 {
-	int i;
-
 	if (!sctx)
 		return;
 
-	for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++)
-		release_scrub_stripe(&sctx->stripes[i]);
-
+	free_scrub_stripes(sctx);
 	kfree(sctx);
 }
 
@@ -331,7 +342,6 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
 		struct btrfs_fs_info *fs_info, int is_dev_replace)
 {
 	struct scrub_ctx *sctx;
-	int		i;
 
 	sctx = kzalloc(sizeof(*sctx), GFP_KERNEL);
 	if (!sctx)
@@ -339,14 +349,6 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
 	refcount_set(&sctx->refs, 1);
 	sctx->is_dev_replace = is_dev_replace;
 	sctx->fs_info = fs_info;
-	for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++) {
-		int ret;
-
-		ret = init_scrub_stripe(fs_info, &sctx->stripes[i]);
-		if (ret < 0)
-			goto nomem;
-		sctx->stripes[i].sctx = sctx;
-	}
 	sctx->first_free = 0;
 	atomic_set(&sctx->cancel_req, 0);
 
@@ -1659,6 +1661,7 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx)
 	const int nr_stripes = sctx->cur_stripe;
 	int ret = 0;
 
+	ASSERT(nr_stripes <= sctx->nr_stripes);
 	if (!nr_stripes)
 		return 0;
 
@@ -1753,8 +1756,11 @@ static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *
 	struct scrub_stripe *stripe;
 	int ret;
 
+	ASSERT(sctx->stripes);
+	ASSERT(sctx->nr_stripes);
+
 	/* No available slot, submit all stripes and wait for them. */
-	if (sctx->cur_stripe >= SCRUB_STRIPES_PER_SCTX) {
+	if (sctx->cur_stripe >= sctx->nr_stripes) {
 		ret = flush_scrub_stripes(sctx);
 		if (ret < 0)
 			return ret;
@@ -2076,6 +2082,30 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
 	return ret;
 }
 
+static int alloc_scrub_stripes(struct scrub_ctx *sctx, int nr_stripes)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	int ret;
+
+	ASSERT(!sctx->stripes);
+	ASSERT(!sctx->nr_stripes);
+	sctx->stripes = kcalloc(nr_stripes, sizeof(struct scrub_stripe),
+				GFP_KERNEL);
+	if (!sctx->stripes)
+		return -ENOMEM;
+	sctx->nr_stripes = nr_stripes;
+	for (int i = 0; i < sctx->nr_stripes; i++) {
+		ret = init_scrub_stripe(fs_info, &sctx->stripes[i]);
+		if (ret < 0)
+			goto cleanup;
+		sctx->stripes[i].sctx = sctx;
+	}
+	return 0;
+cleanup:
+	free_scrub_stripes(sctx);
+	return -ENOMEM;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct btrfs_block_group *bg,
 					   struct extent_map *em,
@@ -2102,6 +2132,10 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 
 	scrub_blocked_if_needed(fs_info);
 
+	ret = alloc_scrub_stripes(sctx, SCRUB_STRIPES_PER_SCTX);
+	if (ret < 0)
+		return ret;
+
 	if (sctx->is_dev_replace &&
 	    btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
 		mutex_lock(&sctx->wr_lock);
@@ -2224,6 +2258,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		kfree(sctx->raid56_data_stripes);
 		sctx->raid56_data_stripes = NULL;
 	}
+	free_scrub_stripes(sctx);
 
 	if (sctx->is_dev_replace && ret >= 0) {
 		int ret2;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/14] btrfs: scrub: introduce the skeleton for logical-scrub
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (5 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 06/14] btrfs: scrub: make scrub_ctx::stripes dynamically allocated Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 08/14] btrfs: scrub: extract the common preparation before scrubbing a block group Qu Wenruo
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Currently btrfs scrub is a per-device operation, this is fine for most
non-RAID56 profiles, but not a good thing for RAID56 profiles.

The main challenge for RAID56 is, we will read out data stripes more
than one times (one for the data stripe scrub itself, more for
parities).

To address this, and maybe even improve the non-RAID56 scrub, here we
introduce a new scrub flag, SCRUB_LOGICAL.

This would be a per-fs operation, and conflicts with any
dev-scrub/dev-replace.

This patch only implements the basic exclusion checks.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c         |  1 +
 fs/btrfs/fs.h              | 12 +++++++
 fs/btrfs/ioctl.c           |  6 +++-
 fs/btrfs/scrub.c           | 72 +++++++++++++++++++++++++++++++++++++-
 fs/btrfs/scrub.h           |  2 ++
 include/uapi/linux/btrfs.h | 11 ++++--
 6 files changed, 100 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7513388b0567..c11f28becaf2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1850,6 +1850,7 @@ static void btrfs_init_scrub(struct btrfs_fs_info *fs_info)
 {
 	mutex_init(&fs_info->scrub_lock);
 	atomic_set(&fs_info->scrubs_running, 0);
+	atomic_set(&fs_info->scrubs_logical_running, 0);
 	atomic_set(&fs_info->scrub_pause_req, 0);
 	atomic_set(&fs_info->scrubs_paused, 0);
 	atomic_set(&fs_info->scrub_cancel_req, 0);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 203d2a267828..f3d50f97ab27 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -632,7 +632,19 @@ struct btrfs_fs_info {
 
 	/* Private scrub information */
 	struct mutex scrub_lock;
+
+	/*
+	 * Number of running scrubbing, including both dev-scrub (at most
+	 * one dev-scrub on each device) and logical-scrub (at most
+	 * one logical-scrub for each fs).
+	 */
 	atomic_t scrubs_running;
+
+	/*
+	 * Number of running logical scrubbing, there is at most one running
+	 * logical scrub for each fs.
+	 */
+	atomic_t scrubs_logical_running;
 	atomic_t scrub_pause_req;
 	atomic_t scrubs_paused;
 	atomic_t scrub_cancel_req;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index edbbd5cf23fc..1a1afcce73e0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3179,7 +3179,11 @@ static long btrfs_ioctl_scrub(struct file *file, void __user *arg)
 			goto out;
 	}
 
-	ret = btrfs_scrub_dev(fs_info, sa->devid, sa->start, sa->end,
+	if (sa->flags & BTRFS_SCRUB_LOGICAL)
+		ret = btrfs_scrub_logical(fs_info, sa->start, sa->end,
+				&sa->progress, sa->flags & BTRFS_SCRUB_READONLY);
+	else
+		ret = btrfs_scrub_dev(fs_info, sa->devid, sa->start, sa->end,
 			      &sa->progress, sa->flags & BTRFS_SCRUB_READONLY,
 			      0);
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 1e49bb066619..ff637f83aa0e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -178,7 +178,8 @@ struct scrub_ctx {
 	int			first_free;
 	int			cur_stripe;
 	atomic_t		cancel_req;
-	int			readonly;
+	bool			readonly;
+	bool			scrub_logical;
 	int			sectors_per_bio;
 
 	/* Number of stripes we have in @stripes. */
@@ -2835,6 +2836,14 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		goto out;
 	}
 
+	/* Dev-scrub conflicts with logical-scrub. */
+	if (atomic_read(&fs_info->scrubs_logical_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		ret = -EINPROGRESS;
+		goto out;
+	}
+
 	down_read(&fs_info->dev_replace.rwsem);
 	if (dev->scrub_ctx ||
 	    (!is_dev_replace &&
@@ -2945,6 +2954,67 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+int btrfs_scrub_logical(struct btrfs_fs_info *fs_info, u64 start, u64 end,
+			struct btrfs_scrub_progress *progress, bool readonly)
+{
+	struct scrub_ctx *sctx;
+	int ret;
+
+	if (btrfs_fs_closing(fs_info))
+		return -EAGAIN;
+
+	/* At mount time we have ensured nodesize is in the range of [4K, 64K]. */
+	ASSERT(fs_info->nodesize <= BTRFS_STRIPE_LEN);
+
+	sctx = scrub_setup_ctx(fs_info, false);
+	if (IS_ERR(sctx))
+		return PTR_ERR(sctx);
+	sctx->scrub_logical = true;
+	sctx->readonly = readonly;
+
+	ret = scrub_workers_get(fs_info, false);
+	if (ret)
+		goto out_free_ctx;
+
+	/* Make sure we're the only running scrub. */
+	mutex_lock(&fs_info->scrub_lock);
+	if (atomic_read(&fs_info->scrubs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		ret = -EINPROGRESS;
+		goto out;
+	}
+	down_read(&fs_info->dev_replace.rwsem);
+	if (btrfs_dev_replace_is_ongoing(&fs_info->dev_replace)) {
+		up_read(&fs_info->dev_replace.rwsem);
+		mutex_unlock(&fs_info->scrub_lock);
+		ret = -EINPROGRESS;
+		goto out;
+	}
+	up_read(&fs_info->dev_replace.rwsem);
+	/*
+	 * Checking @scrub_pause_req here, we can avoid
+	 * race between committing transaction and scrubbing.
+	 */
+	__scrub_blocked_if_needed(fs_info);
+	atomic_inc(&fs_info->scrubs_running);
+	atomic_inc(&fs_info->scrubs_logical_running);
+	mutex_unlock(&fs_info->scrub_lock);
+
+	/* The main work would be implemented. */
+	ret = -EOPNOTSUPP;
+
+	atomic_dec(&fs_info->scrubs_running);
+	atomic_dec(&fs_info->scrubs_logical_running);
+	wake_up(&fs_info->scrub_pause_wait);
+	if (progress)
+		memcpy(progress, &sctx->stat, sizeof(*progress));
+out:
+	scrub_workers_put(fs_info);
+out_free_ctx:
+	scrub_free_ctx(sctx);
+	return ret;
+}
+
 void btrfs_scrub_pause(struct btrfs_fs_info *fs_info)
 {
 	mutex_lock(&fs_info->scrub_lock);
diff --git a/fs/btrfs/scrub.h b/fs/btrfs/scrub.h
index 7639103ebf9d..22db0f71083e 100644
--- a/fs/btrfs/scrub.h
+++ b/fs/btrfs/scrub.h
@@ -6,6 +6,8 @@
 int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		    u64 end, struct btrfs_scrub_progress *progress,
 		    int readonly, int is_dev_replace);
+int btrfs_scrub_logical(struct btrfs_fs_info *fs_info, u64 start, u64 end,
+			struct btrfs_scrub_progress *progress, bool readonly);
 void btrfs_scrub_pause(struct btrfs_fs_info *fs_info);
 void btrfs_scrub_continue(struct btrfs_fs_info *fs_info);
 int btrfs_scrub_cancel(struct btrfs_fs_info *info);
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dbb8b96da50d..fa5cb3b3611b 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -186,8 +186,15 @@ struct btrfs_scrub_progress {
 					 * Intermittent error. */
 };
 
-#define BTRFS_SCRUB_READONLY	1
-#define BTRFS_SCRUB_SUPPORTED_FLAGS	(BTRFS_SCRUB_READONLY)
+#define BTRFS_SCRUB_READONLY	(1ULL << 0)
+
+/*
+ * Regular scrub is based on device, while with this flag, scrub is
+ * based on logical, and @devid would be ignored.
+ */
+#define BTRFS_SCRUB_LOGICAL	(1ULL << 1)
+#define BTRFS_SCRUB_SUPPORTED_FLAGS	(BTRFS_SCRUB_READONLY |\
+					 BTRFS_SCRUB_LOGICAL)
 struct btrfs_ioctl_scrub_args {
 	__u64 devid;				/* in */
 	__u64 start;				/* in */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/14] btrfs: scrub: extract the common preparation before scrubbing a block group
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (6 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 07/14] btrfs: scrub: introduce the skeleton for logical-scrub Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 09/14] btrfs: scrub: implement the chunk iteration code for scrub_logical Qu Wenruo
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new helper, prepare_scrub_block_group(), to handle the
checks before scrubbing a block group.

For now the helper is only called by scrub_enumerate_chunks(), but would
be reused by later scrub_logical feature.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 264 ++++++++++++++++++++++++++---------------------
 1 file changed, 147 insertions(+), 117 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ff637f83aa0e..806c4683a7ef 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2343,6 +2343,143 @@ static int finish_extent_writes_for_zoned(struct btrfs_root *root,
 	return btrfs_commit_transaction(trans);
 }
 
+/*
+ * Do the preparation before we scrub the block group.
+ *
+ * Return >0 if we don't need to scrub the block group.
+ * Return 0 if we have done the preparation and set @ro_set_ret.
+ * Return <0 for error and can not scrub the bg.
+ */
+static int prepare_scrub_block_group(struct scrub_ctx *sctx,
+				     struct btrfs_block_group *bg,
+				     bool *ro_set_ret)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	struct btrfs_root *root = fs_info->dev_root;
+	bool ro_set = false;
+	int ret;
+
+	if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
+		if (!test_bit(BLOCK_GROUP_FLAG_TO_COPY, &bg->runtime_flags))
+			return 1;
+	}
+
+	/*
+	 * Make sure that while we are scrubbing the corresponding block
+	 * group doesn't get its logical address and its device extents
+	 * reused for another block group, which can possibly be of a
+	 * different type and different profile. We do this to prevent
+	 * false error detections and crashes due to bogus attempts to
+	 * repair extents.
+	 */
+	spin_lock(&bg->lock);
+	if (test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags)) {
+		spin_unlock(&bg->lock);
+		return 1;
+	}
+	btrfs_freeze_block_group(bg);
+	spin_unlock(&bg->lock);
+
+	/*
+	 * we need call btrfs_inc_block_group_ro() with scrubs_paused,
+	 * to avoid deadlock caused by:
+	 * btrfs_inc_block_group_ro()
+	 * -> btrfs_wait_for_commit()
+	 * -> btrfs_commit_transaction()
+	 * -> btrfs_scrub_pause()
+	 */
+	scrub_pause_on(fs_info);
+
+	/*
+	 * Don't do chunk preallocation for scrub.
+	 *
+	 * This is especially important for SYSTEM bgs, or we can hit
+	 * -EFBIG from btrfs_finish_chunk_alloc() like:
+	 * 1. The only SYSTEM bg is marked RO.
+	 *    Since SYSTEM bg is small, that's pretty common.
+	 * 2. New SYSTEM bg will be allocated
+	 *    Due to regular version will allocate new chunk.
+	 * 3. New SYSTEM bg is empty and will get cleaned up
+	 *    Before cleanup really happens, it's marked RO again.
+	 * 4. Empty SYSTEM bg get scrubbed
+	 *    We go back to 2.
+	 *
+	 * This can easily boost the amount of SYSTEM chunks if cleaner
+	 * thread can't be triggered fast enough, and use up all space
+	 * of btrfs_super_block::sys_chunk_array
+	 *
+	 * While for dev replace, we need to try our best to mark block
+	 * group RO, to prevent race between:
+	 * - Write duplication
+	 *   Contains latest data
+	 * - Scrub copy
+	 *   Contains data from commit tree
+	 *
+	 * If target block group is not marked RO, nocow writes can
+	 * be overwritten by scrub copy, causing data corruption.
+	 * So for dev-replace, it's not allowed to continue if a block
+	 * group is not RO.
+	 */
+	ret = btrfs_inc_block_group_ro(bg, sctx->is_dev_replace);
+	if (!ret && sctx->is_dev_replace) {
+		ret = finish_extent_writes_for_zoned(root, bg);
+		if (ret) {
+			btrfs_dec_block_group_ro(bg);
+			scrub_pause_off(fs_info);
+			return ret;
+		}
+	}
+
+	if (ret == 0)
+		ro_set = 1;
+	if (ret == -ENOSPC && !sctx->is_dev_replace &&
+		   !(bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK))
+		/*
+		 * btrfs_inc_block_group_ro return -ENOSPC when it
+		 * failed in creating new chunk for metadata.
+		 * It is not a problem for scrub, because
+		 * metadata are always cowed, and our scrub paused
+		 * commit_transactions.
+		 *
+		 * For RAID56 chunks, we have to mark them read-only
+		 * for scrub, as later we would use our own cache
+		 * out of RAID56 realm.
+		 * Thus we want the RAID56 bg to be marked RO to
+		 * prevent RMW from screwing up out cache.
+		 */
+		ro_set = 0;
+
+	if (ret == -ETXTBSY) {
+		btrfs_warn(fs_info,
+	   "skipping scrub of block group %llu due to active swapfile",
+			   bg->start);
+		btrfs_unfreeze_block_group(bg);
+		scrub_pause_off(fs_info);
+		return ret;
+	}
+	if (ret < 0) {
+		btrfs_warn(fs_info,
+			   "failed setting block group ro: %d", ret);
+		btrfs_unfreeze_block_group(bg);
+		scrub_pause_off(fs_info);
+		return ret;
+	}
+
+	/*
+	 * Now the target block is marked RO, wait for nocow writes to
+	 * finish before dev-replace.
+	 * COW is fine, as COW never overwrites extents in commit tree.
+	 */
+	if (sctx->is_dev_replace) {
+		btrfs_wait_nocow_writers(bg);
+		btrfs_wait_ordered_roots(fs_info, U64_MAX, bg->start,
+					 bg->length);
+	}
+	scrub_pause_off(fs_info);
+	*ro_set_ret = ro_set;
+	return 0;
+}
+
 static noinline_for_stack
 int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			   struct btrfs_device *scrub_dev, u64 start, u64 end)
@@ -2353,7 +2490,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 	struct btrfs_root *root = fs_info->dev_root;
 	u64 chunk_offset;
 	int ret = 0;
-	int ro_set;
 	int slot;
 	struct extent_buffer *l;
 	struct btrfs_key key;
@@ -2374,6 +2510,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 	key.type = BTRFS_DEV_EXTENT_KEY;
 
 	while (1) {
+		bool ro_set = false;
 		u64 dev_extent_len;
 
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
@@ -2455,127 +2592,20 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			goto skip;
 		}
 
-		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
-			if (!test_bit(BLOCK_GROUP_FLAG_TO_COPY, &cache->runtime_flags)) {
-				btrfs_put_block_group(cache);
-				goto skip;
-			}
+		ret = prepare_scrub_block_group(sctx, cache, &ro_set);
+		if (ret == -ETXTBSY) {
+			ret = 0;
+			goto skip_unfreeze;
 		}
-
-		/*
-		 * Make sure that while we are scrubbing the corresponding block
-		 * group doesn't get its logical address and its device extents
-		 * reused for another block group, which can possibly be of a
-		 * different type and different profile. We do this to prevent
-		 * false error detections and crashes due to bogus attempts to
-		 * repair extents.
-		 */
-		spin_lock(&cache->lock);
-		if (test_bit(BLOCK_GROUP_FLAG_REMOVED, &cache->runtime_flags)) {
-			spin_unlock(&cache->lock);
+		if (ret < 0) {
+			btrfs_put_block_group(cache);
+			break;
+		}
+		if (ret > 0) {
 			btrfs_put_block_group(cache);
 			goto skip;
 		}
-		btrfs_freeze_block_group(cache);
-		spin_unlock(&cache->lock);
 
-		/*
-		 * we need call btrfs_inc_block_group_ro() with scrubs_paused,
-		 * to avoid deadlock caused by:
-		 * btrfs_inc_block_group_ro()
-		 * -> btrfs_wait_for_commit()
-		 * -> btrfs_commit_transaction()
-		 * -> btrfs_scrub_pause()
-		 */
-		scrub_pause_on(fs_info);
-
-		/*
-		 * Don't do chunk preallocation for scrub.
-		 *
-		 * This is especially important for SYSTEM bgs, or we can hit
-		 * -EFBIG from btrfs_finish_chunk_alloc() like:
-		 * 1. The only SYSTEM bg is marked RO.
-		 *    Since SYSTEM bg is small, that's pretty common.
-		 * 2. New SYSTEM bg will be allocated
-		 *    Due to regular version will allocate new chunk.
-		 * 3. New SYSTEM bg is empty and will get cleaned up
-		 *    Before cleanup really happens, it's marked RO again.
-		 * 4. Empty SYSTEM bg get scrubbed
-		 *    We go back to 2.
-		 *
-		 * This can easily boost the amount of SYSTEM chunks if cleaner
-		 * thread can't be triggered fast enough, and use up all space
-		 * of btrfs_super_block::sys_chunk_array
-		 *
-		 * While for dev replace, we need to try our best to mark block
-		 * group RO, to prevent race between:
-		 * - Write duplication
-		 *   Contains latest data
-		 * - Scrub copy
-		 *   Contains data from commit tree
-		 *
-		 * If target block group is not marked RO, nocow writes can
-		 * be overwritten by scrub copy, causing data corruption.
-		 * So for dev-replace, it's not allowed to continue if a block
-		 * group is not RO.
-		 */
-		ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
-		if (!ret && sctx->is_dev_replace) {
-			ret = finish_extent_writes_for_zoned(root, cache);
-			if (ret) {
-				btrfs_dec_block_group_ro(cache);
-				scrub_pause_off(fs_info);
-				btrfs_put_block_group(cache);
-				break;
-			}
-		}
-
-		if (ret == 0) {
-			ro_set = 1;
-		} else if (ret == -ENOSPC && !sctx->is_dev_replace &&
-			   !(cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK)) {
-			/*
-			 * btrfs_inc_block_group_ro return -ENOSPC when it
-			 * failed in creating new chunk for metadata.
-			 * It is not a problem for scrub, because
-			 * metadata are always cowed, and our scrub paused
-			 * commit_transactions.
-			 *
-			 * For RAID56 chunks, we have to mark them read-only
-			 * for scrub, as later we would use our own cache
-			 * out of RAID56 realm.
-			 * Thus we want the RAID56 bg to be marked RO to
-			 * prevent RMW from screwing up out cache.
-			 */
-			ro_set = 0;
-		} else if (ret == -ETXTBSY) {
-			btrfs_warn(fs_info,
-		   "skipping scrub of block group %llu due to active swapfile",
-				   cache->start);
-			scrub_pause_off(fs_info);
-			ret = 0;
-			goto skip_unfreeze;
-		} else {
-			btrfs_warn(fs_info,
-				   "failed setting block group ro: %d", ret);
-			btrfs_unfreeze_block_group(cache);
-			btrfs_put_block_group(cache);
-			scrub_pause_off(fs_info);
-			break;
-		}
-
-		/*
-		 * Now the target block is marked RO, wait for nocow writes to
-		 * finish before dev-replace.
-		 * COW is fine, as COW never overwrites extents in commit tree.
-		 */
-		if (sctx->is_dev_replace) {
-			btrfs_wait_nocow_writers(cache);
-			btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start,
-					cache->length);
-		}
-
-		scrub_pause_off(fs_info);
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_right = found_key.offset + dev_extent_len;
 		dev_replace->cursor_left = found_key.offset;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/14] btrfs: scrub: implement the chunk iteration code for scrub_logical
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (7 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 08/14] btrfs: scrub: extract the common preparation before scrubbing a block group Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 10/14] btrfs: scrub: implement the basic extent iteration code Qu Wenruo
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

This patch implements the code to iterate chunks for scrub_logical
feature.

The differences between scrub_logical and scrub_dev are:

- No need to lookup dev-extent tree
  As scrub_logical can directly grab the block group and check against
  the logical range.
  Thus we don't need to do the dev-extent lookup.

- No dev-replace related work
  Scrub_logical would only be a feature to enhance full fs scrub, thus
  no need to do dev-replace related checks.

For now only the block group iteration code is implemented, the real
scrub part is not yet implemented.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 806c4683a7ef..79e47eee9d1b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2984,6 +2984,69 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+static int scrub_logical_chunks(struct scrub_ctx *sctx, u64 start, u64 end)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	u64 cur = start;
+	int ret = 0;
+
+	while (cur < end) {
+		struct btrfs_block_group *bg;
+		bool ro_set = false;
+
+		bg = btrfs_lookup_first_block_group(fs_info, cur);
+
+		/* No more block groups. */
+		if (!bg)
+			break;
+		if (bg->start > end)
+			break;
+
+		ret = prepare_scrub_block_group(sctx, bg, &ro_set);
+		if (ret == -ETXTBSY) {
+			ret = 0;
+			goto next;
+		}
+		if (ret > 0)
+			goto next;
+		if (ret < 0) {
+			btrfs_put_block_group(bg);
+			break;
+		}
+
+		/* The real work starts here. */
+		ret = -EOPNOTSUPP;
+
+		if (ro_set)
+			btrfs_dec_block_group_ro(bg);
+		/*
+		 * We might have prevented the cleaner kthread from deleting
+		 * this block group if it was already unused because we raced
+		 * and set it to RO mode first. So add it back to the unused
+		 * list, otherwise it might not ever be deleted unless a manual
+		 * balance is triggered or it becomes used and unused again.
+		 */
+		spin_lock(&bg->lock);
+		if (!test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags) &&
+		    !bg->ro && bg->reserved == 0 && bg->used == 0) {
+			spin_unlock(&bg->lock);
+			if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+				btrfs_discard_queue_work(&fs_info->discard_ctl, bg);
+			else
+				btrfs_mark_bg_unused(bg);
+		} else {
+			spin_unlock(&bg->lock);
+		}
+		btrfs_unfreeze_block_group(bg);
+next:
+		cur = bg->start + bg->length;
+		btrfs_put_block_group(bg);
+		if (ret < 0)
+			break;
+	}
+	return ret;
+}
+
 int btrfs_scrub_logical(struct btrfs_fs_info *fs_info, u64 start, u64 end,
 			struct btrfs_scrub_progress *progress, bool readonly)
 {
@@ -3030,8 +3093,7 @@ int btrfs_scrub_logical(struct btrfs_fs_info *fs_info, u64 start, u64 end,
 	atomic_inc(&fs_info->scrubs_logical_running);
 	mutex_unlock(&fs_info->scrub_lock);
 
-	/* The main work would be implemented. */
-	ret = -EOPNOTSUPP;
+	ret = scrub_logical_chunks(sctx, start, end);
 
 	atomic_dec(&fs_info->scrubs_running);
 	atomic_dec(&fs_info->scrubs_logical_running);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/14] btrfs: scrub: implement the basic extent iteration code
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (8 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 09/14] btrfs: scrub: implement the chunk iteration code for scrub_logical Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 11/14] btrfs: scrub: implement the repair part of scrub logical Qu Wenruo
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

Unlike the per-device scrub code, logical bytenr based scrub code is
much easier iterating the extents.

The difference is mainly in the stripe queueing part, as we need to
queue @nr_copies stripes at once.

So here we introduce a helper, scrub_copy_stripe(), and fill the first
stripe, then use that helper to fill the remaining copies.

For now, we still use the same flush_scrub_stripes(), but the repair
part is less efficient.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 163 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 160 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 79e47eee9d1b..02b212fe4712 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1471,7 +1471,7 @@ static void scrub_stripe_reset_bitmaps(struct scrub_stripe *stripe)
 static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
 					struct btrfs_device *dev, u64 physical,
 					int mirror_num, u64 logical_start,
-					u32 logical_len,
+					u64 logical_len,
 					struct scrub_stripe *stripe)
 {
 	struct btrfs_fs_info *fs_info = bg->fs_info;
@@ -2984,6 +2984,164 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+static void scrub_copy_stripe(const struct scrub_stripe *src,
+			      struct scrub_stripe *dst)
+{
+	struct scrub_ctx *sctx = src->sctx;
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+
+	/* This function should only be called for logical scrub. */
+	ASSERT(sctx->scrub_logical);
+	scrub_reset_stripe(dst);
+
+	dst->bg = src->bg;
+	dst->dev = NULL;
+	dst->physical = 0;
+	dst->logical = src->logical;
+	dst->mirror_num = src->mirror_num;
+
+	bitmap_copy(&dst->extent_sector_bitmap, &src->extent_sector_bitmap,
+		    src->nr_sectors);
+	memcpy(dst->csums, src->csums,
+	       (BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits) *
+		fs_info->csum_size);
+
+	for (int i = 0; i < src->nr_sectors; i++) {
+		dst->sectors[i].is_metadata = src->sectors[i].is_metadata;
+		if (src->sectors[i].is_metadata)
+			dst->sectors[i].generation = src->sectors[i].generation;
+		else if (src->sectors[i].csum)
+			dst->sectors[i].csum = dst->csums + i * fs_info->csum_size;
+	}
+	dst->nr_data_extents = src->nr_data_extents;
+	dst->nr_meta_extents = src->nr_meta_extents;
+	set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &dst->state);
+}
+
+/*
+ * Unlike queue_scrub_stripe() which only queue one stripe, this queue all
+ * mirrors for non-RAID56 profiles, or the full stripe for RAID56.
+ */
+static int queue_scrub_logical_stripes(struct scrub_ctx *sctx,
+			struct btrfs_block_group *bg, u64 logical)
+{
+	const int raid_index = btrfs_bg_flags_to_raid_index(bg->flags);
+	const int nr_copies = btrfs_raid_array[raid_index].ncopies;
+	const int old_cur = sctx->cur_stripe;
+	struct scrub_stripe *stripe;
+	int ret;
+
+	ASSERT(sctx->stripes);
+	ASSERT(sctx->nr_stripes);
+
+	if (sctx->cur_stripe + nr_copies > sctx->nr_stripes) {
+		ret = flush_scrub_stripes(sctx);
+		if (ret < 0)
+			return ret;
+	}
+
+	stripe = &sctx->stripes[old_cur];
+
+	scrub_reset_stripe(stripe);
+	ret = scrub_find_fill_first_stripe(bg, NULL, 0, 1,
+			logical, bg->start + bg->length - logical, stripe);
+	/* Either >0 as no more extents or <0 for error. */
+	if (ret)
+		return ret;
+
+	/* For the remaining slots, just copy the above mirror. */
+	for (int i = 1; i < nr_copies; i++) {
+		struct scrub_stripe *dst = &sctx->stripes[old_cur + i];
+
+		scrub_copy_stripe(stripe, dst);
+		dst->mirror_num = i + 1;
+	}
+	sctx->cur_stripe += nr_copies;
+	return 0;
+}
+
+static int scrub_logical_one_chunk(struct scrub_ctx *sctx,
+				   struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	struct extent_map_tree *map_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	u64 cur = bg->start;
+	const int raid_index = btrfs_bg_flags_to_raid_index(bg->flags);
+	const int nr_copies = btrfs_raid_array[raid_index].ncopies;
+	int nr_stripes;
+	int ret = 0;
+	int flush_ret;
+
+	read_lock(&map_tree->lock);
+	em = lookup_extent_mapping(map_tree, bg->start, bg->length);
+	read_unlock(&map_tree->lock);
+	if (!em) {
+		/*
+		 * Might have been an unused block group deleted by the cleaner
+		 * kthread or relocation.
+		 */
+		spin_lock(&bg->lock);
+		if (!test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags))
+			ret = -EINVAL;
+		spin_unlock(&bg->lock);
+		return ret;
+	}
+
+	scrub_blocked_if_needed(fs_info);
+
+	/* RAID56 not yet supported. */
+	if (bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	nr_stripes = SCRUB_STRIPES_PER_SCTX * nr_copies;
+	ret = alloc_scrub_stripes(sctx, nr_stripes);
+	if (ret < 0)
+		goto out;
+
+	while (cur < bg->start + bg->length) {
+		/* Canceled? */
+		if (atomic_read(&fs_info->scrub_cancel_req) ||
+		    atomic_read(&sctx->cancel_req)) {
+			ret = -ECANCELED;
+			break;
+		}
+		/* Paused? */
+		if (atomic_read(&fs_info->scrub_pause_req)) {
+			/* Push queued extents */
+			scrub_blocked_if_needed(fs_info);
+		}
+		/* Block group removed? */
+		spin_lock(&bg->lock);
+		if (test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags)) {
+			spin_unlock(&bg->lock);
+			ret = 0;
+			break;
+		}
+		spin_unlock(&bg->lock);
+
+		ret = queue_scrub_logical_stripes(sctx, bg, cur);
+		if (ret > 0) {
+			ret = 0;
+			break;
+		}
+		if (ret < 0)
+			break;
+		ASSERT(sctx->cur_stripe >= 1);
+		cur = sctx->stripes[sctx->cur_stripe - 1].logical + BTRFS_STRIPE_LEN;
+	}
+out:
+	flush_ret = flush_scrub_stripes(sctx);
+	if (!ret)
+		ret = flush_ret;
+	free_scrub_stripes(sctx);
+	free_extent_map(em);
+	return ret;
+
+}
+
 static int scrub_logical_chunks(struct scrub_ctx *sctx, u64 start, u64 end)
 {
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
@@ -3014,8 +3172,7 @@ static int scrub_logical_chunks(struct scrub_ctx *sctx, u64 start, u64 end)
 			break;
 		}
 
-		/* The real work starts here. */
-		ret = -EOPNOTSUPP;
+		ret = scrub_logical_one_chunk(sctx, bg);
 
 		if (ro_set)
 			btrfs_dec_block_group_ro(bg);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/14] btrfs: scrub: implement the repair part of scrub logical
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (9 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 10/14] btrfs: scrub: implement the basic extent iteration code Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 12/14] btrfs: scrub: add RAID56 support for queue_scrub_logical_stripes() Qu Wenruo
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

The repair part of scrub logical is done differently compared to the
per-device counterpart:

- We read out all mirrors in one go
  Since it's no longer per-device, we just read out all mirrors.

- Find a good mirror for the same sectornr of all mirrors

- Copy the good content to any corrupted sector

This has several advantages:

- Less IO wait
  Since all IOs are submitted at the very beginning, we avoid the read
  then wait for per-device scrub.

  This applies to both read and write part.

This needs some changes to the per-device scrub code though:

- Call the scrub_verify_one_stripe() inside scrub_read_endio()
  This is to improve the performance, as we can have csum verification
  per-mirror.

- Do not queue scrub_stripe_read_repair_worker() workload for
  scrub_logical
  Since we don't need to go per-device repair path.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 148 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 138 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 02b212fe4712..21b3fa55d0f0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -995,7 +995,6 @@ static void scrub_stripe_read_repair_worker(struct work_struct *work)
 	ASSERT(stripe->mirror_num > 0);
 
 	wait_scrub_stripe_io(stripe);
-	scrub_verify_one_stripe(stripe, stripe->extent_sector_bitmap);
 	/* Save the initial failed bitmap for later repair and report usage. */
 	stripe->init_error_bitmap = stripe->error_bitmap;
 	stripe->init_nr_io_errors = bitmap_weight(&stripe->io_error_bitmap,
@@ -1068,9 +1067,12 @@ static void scrub_read_endio(struct btrfs_bio *bbio)
 	}
 	bio_put(&bbio->bio);
 	if (atomic_dec_and_test(&stripe->pending_io)) {
+		scrub_verify_one_stripe(stripe, stripe->extent_sector_bitmap);
 		wake_up(&stripe->io_wait);
-		INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
-		queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
+		if (!stripe->sctx->scrub_logical) {
+			INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
+			queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
+		}
 	}
 }
 
@@ -1655,7 +1657,131 @@ static bool stripe_has_metadata_error(struct scrub_stripe *stripe)
 	return false;
 }
 
-static int flush_scrub_stripes(struct scrub_ctx *sctx)
+/*
+ * Unlike the per-device repair, we have all mirrors read out already.
+ *
+ * Thus we only need to find out the good mirror, and copy the content to
+ * any bad sectors.
+ */
+static void repair_one_mirror_group(struct scrub_ctx *sctx, int start_stripe,
+				    int ncopies)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	struct scrub_stripe *first_stripe = &sctx->stripes[start_stripe];
+	struct scrub_stripe *cur_stripe;
+	const u32 sectorsize = fs_info->sectorsize;
+	int sectornr;
+
+	ASSERT(start_stripe + ncopies <= sctx->cur_stripe);
+
+	for_each_set_bit(sectornr, &first_stripe->extent_sector_bitmap,
+			 first_stripe->nr_sectors) {
+		struct scrub_stripe *good_stripe;
+		int good_mirror = -1;
+
+		for (int cur_mirror = start_stripe;
+		     cur_mirror < start_stripe + ncopies; cur_mirror++) {
+			cur_stripe = &sctx->stripes[cur_mirror];
+
+			if (!test_bit(sectornr, &cur_stripe->error_bitmap)) {
+				good_mirror = cur_mirror;
+				break;
+			}
+		}
+		/* No good mirror found, this vertical stripe can not be repaired. */
+		if (good_mirror < 0)
+			continue;
+
+		good_stripe = &sctx->stripes[good_mirror];
+
+		for (int cur_mirror = start_stripe;
+		     cur_mirror < start_stripe + ncopies; cur_mirror++) {
+			cur_stripe = &sctx->stripes[cur_mirror];
+
+			if (!test_bit(sectornr, &cur_stripe->error_bitmap))
+				continue;
+			/* Repair from the good mirror. */
+			memcpy_page(scrub_stripe_get_page(cur_stripe, sectornr),
+				    scrub_stripe_get_page_offset(cur_stripe, sectornr),
+				    scrub_stripe_get_page(good_stripe, sectornr),
+				    scrub_stripe_get_page_offset(good_stripe, sectornr),
+				    sectorsize);
+			clear_bit(sectornr, &cur_stripe->error_bitmap);
+			clear_bit(sectornr, &cur_stripe->io_error_bitmap);
+			if (cur_stripe->sectors[sectornr].is_metadata)
+				clear_bit(sectornr, &cur_stripe->meta_error_bitmap);
+			else
+				clear_bit(sectornr, &cur_stripe->csum_error_bitmap);
+		}
+	}
+	for (int cur_mirror = start_stripe; cur_mirror < start_stripe + ncopies;
+	     cur_mirror++) {
+		cur_stripe = &sctx->stripes[cur_mirror];
+		set_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &cur_stripe->state);
+		scrub_stripe_report_errors(sctx, cur_stripe);
+		wake_up(&cur_stripe->repair_wait);
+
+		if (btrfs_is_zoned(fs_info)) {
+			if (!bitmap_empty(&cur_stripe->init_error_bitmap,
+					  cur_stripe->nr_sectors)) {
+				btrfs_repair_one_zone(fs_info, cur_stripe->logical);
+				break;
+			}
+		}
+		if (!sctx->readonly) {
+			unsigned long repaired;
+
+			bitmap_andnot(&repaired, &cur_stripe->init_error_bitmap,
+				      &cur_stripe->error_bitmap,
+				      cur_stripe->nr_sectors);
+			scrub_write_sectors(sctx, cur_stripe, repaired, false);
+		}
+	}
+	/* Wait for above writeback to finish. */
+	for (int cur_mirror = start_stripe; cur_mirror < start_stripe + ncopies;
+	     cur_mirror++) {
+		cur_stripe = &sctx->stripes[cur_mirror];
+
+		wait_scrub_stripe_io(cur_stripe);
+	}
+}
+
+static int handle_logical_stripes(struct scrub_ctx *sctx,
+				  struct btrfs_block_group *bg)
+{
+	const int nr_stripes = sctx->cur_stripe;
+	const int group_stripes = btrfs_bg_type_to_factor(bg->flags);
+	struct scrub_stripe *stripe;
+
+	for (int i = 0; i < nr_stripes; i++) {
+		stripe = &sctx->stripes[i];
+
+		scrub_submit_initial_read(sctx, stripe);
+	}
+	for (int i = 0; i < nr_stripes; i++) {
+		stripe = &sctx->stripes[i];
+
+		wait_scrub_stripe_io(stripe);
+
+		/* Save the initial failed bitmap for later repair and report usage. */
+		stripe->init_error_bitmap = stripe->error_bitmap;
+		stripe->init_nr_io_errors =
+			bitmap_weight(&stripe->io_error_bitmap, stripe->nr_sectors);
+		stripe->init_nr_csum_errors =
+			bitmap_weight(&stripe->csum_error_bitmap, stripe->nr_sectors);
+		stripe->init_nr_meta_errors =
+			bitmap_weight(&stripe->meta_error_bitmap, stripe->nr_sectors);
+	}
+
+	for (int i = 0; i < nr_stripes; i += group_stripes)
+		repair_one_mirror_group(sctx, i, group_stripes);
+	sctx->cur_stripe = 0;
+
+	return 0;
+}
+
+static int flush_scrub_stripes(struct scrub_ctx *sctx,
+			       struct btrfs_block_group *bg)
 {
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	struct scrub_stripe *stripe;
@@ -1666,6 +1792,9 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx)
 	if (!nr_stripes)
 		return 0;
 
+	if (sctx->scrub_logical)
+		return handle_logical_stripes(sctx, bg);
+
 	ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state));
 
 	scrub_throttle_dev_io(sctx, sctx->stripes[0].dev,
@@ -1762,7 +1891,7 @@ static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *
 
 	/* No available slot, submit all stripes and wait for them. */
 	if (sctx->cur_stripe >= sctx->nr_stripes) {
-		ret = flush_scrub_stripes(sctx);
+		ret = flush_scrub_stripes(sctx, bg);
 		if (ret < 0)
 			return ret;
 	}
@@ -2250,7 +2379,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 			break;
 	}
 out:
-	ret2 = flush_scrub_stripes(sctx);
+	ret2 = flush_scrub_stripes(sctx, bg);
 	if (!ret)
 		ret = ret2;
 	if (sctx->raid56_data_stripes) {
@@ -3025,8 +3154,7 @@ static void scrub_copy_stripe(const struct scrub_stripe *src,
 static int queue_scrub_logical_stripes(struct scrub_ctx *sctx,
 			struct btrfs_block_group *bg, u64 logical)
 {
-	const int raid_index = btrfs_bg_flags_to_raid_index(bg->flags);
-	const int nr_copies = btrfs_raid_array[raid_index].ncopies;
+	const int nr_copies = btrfs_bg_type_to_factor(bg->flags);
 	const int old_cur = sctx->cur_stripe;
 	struct scrub_stripe *stripe;
 	int ret;
@@ -3035,7 +3163,7 @@ static int queue_scrub_logical_stripes(struct scrub_ctx *sctx,
 	ASSERT(sctx->nr_stripes);
 
 	if (sctx->cur_stripe + nr_copies > sctx->nr_stripes) {
-		ret = flush_scrub_stripes(sctx);
+		ret = flush_scrub_stripes(sctx, bg);
 		if (ret < 0)
 			return ret;
 	}
@@ -3133,7 +3261,7 @@ static int scrub_logical_one_chunk(struct scrub_ctx *sctx,
 		cur = sctx->stripes[sctx->cur_stripe - 1].logical + BTRFS_STRIPE_LEN;
 	}
 out:
-	flush_ret = flush_scrub_stripes(sctx);
+	flush_ret = flush_scrub_stripes(sctx, bg);
 	if (!ret)
 		ret = flush_ret;
 	free_scrub_stripes(sctx);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/14] btrfs: scrub: add RAID56 support for queue_scrub_logical_stripes()
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (10 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 11/14] btrfs: scrub: implement the repair part of scrub logical Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 13/14] btrfs: scrub: introduce the RAID56 data recovery path for scrub logical Qu Wenruo
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

There are several factors that needs to be considered for RAID56 full
stripe:

- Empty data stripes
  We still need those empty data stripes, as they may be involved in the
  P/Q recovery.

- P/Q stripes
  We need special handling for P/Q stripes.

So this patch introduce a new helper, queue_raid56_full_stripes() to
handle those special requirements.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 108 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 105 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 21b3fa55d0f0..f37b9fd30595 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -105,7 +105,7 @@ struct scrub_stripe {
 	u64 logical;
 	u64 physical;
 
-	u16 mirror_num;
+	s16 mirror_num;
 
 	/* Should be BTRFS_STRIPE_LEN / sectorsize. */
 	u16 nr_sectors;
@@ -3147,15 +3147,109 @@ static void scrub_copy_stripe(const struct scrub_stripe *src,
 	set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &dst->state);
 }
 
+static int queue_raid56_full_stripes(struct scrub_ctx *sctx,
+			struct btrfs_block_group *bg, struct map_lookup *map,
+			u64 logical)
+{
+	const int num_stripes = map->num_stripes;
+	const int old_cur = sctx->cur_stripe;
+	const int data_stripes = nr_data_stripes(map);
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bg->start);
+	struct btrfs_path path = { 0 };
+	unsigned long extents_bitmap = 0;
+	u64 extent_start;
+	u64 extent_len;
+	u64 extent_flags;
+	u64 extent_gen;
+	u64 full_stripe_start;
+	u32 full_stripe_nr;
+	int ret;
+
+	path.search_commit_root = 1;
+	path.skip_locking = 1;
+
+	ret = find_first_extent_item(extent_root, &path, logical,
+				     bg->start + bg->length - logical);
+	/* Error or no more extent found. */
+	if (ret)
+		return ret;
+	get_extent_info(&path, &extent_start, &extent_len, &extent_flags,
+			&extent_gen);
+	btrfs_release_path(&path);
+
+	full_stripe_nr = rounddown((extent_start - bg->start) >>
+				   BTRFS_STRIPE_LEN_SHIFT, data_stripes);
+	full_stripe_start = btrfs_stripe_nr_to_offset(full_stripe_nr) + bg->start;
+
+	for (int i = 0; i < data_stripes; i++) {
+		struct scrub_stripe *stripe = &sctx->stripes[old_cur + i];
+		u64 cur = full_stripe_start + btrfs_stripe_nr_to_offset(i);
+
+		scrub_reset_stripe(stripe);
+		ret = scrub_find_fill_first_stripe(bg, NULL, 0, 1, cur,
+						   BTRFS_STRIPE_LEN, stripe);
+		if (ret < 0)
+			return ret;
+		/*
+		 * No extent in the data stripe, we need to fill the info
+		 * manually.
+		 */
+		if (ret > 0) {
+			stripe->logical = cur;
+			stripe->physical = 0;
+			stripe->dev = NULL;
+			stripe->bg = bg;
+			stripe->mirror_num = 1;
+			set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state);
+			ret = 0;
+		}
+		bitmap_or(&extents_bitmap, &extents_bitmap,
+			  &stripe->extent_sector_bitmap, stripe->nr_sectors);
+	}
+
+	/* Manually fill the P/Q stripes */
+	for (int i = 0; i < num_stripes - data_stripes; i++) {
+		struct scrub_stripe *stripe =
+			&sctx->stripes[old_cur + data_stripes + i];
+		scrub_reset_stripe(stripe);
+		/*
+		 * Use the logical of the last data stripes, so the caller
+		 * knows exactly where the next logical should start, no matter
+		 * if it's RAID56 or not.
+		 */
+		stripe->logical = full_stripe_start +
+				btrfs_stripe_nr_to_offset(data_stripes - 1);
+		stripe->dev = NULL;
+		stripe->bg = bg;
+		stripe->mirror_num = -1 - i;
+		bitmap_copy(&stripe->extent_sector_bitmap, &extents_bitmap,
+			    stripe->nr_sectors);
+		/*
+		 * For parity stripes, they don't need any csum checks,
+		 * so mark them as nocsum data.
+		 */
+		for (int sector = 0; sector < stripe->nr_sectors; sector++) {
+			stripe->sectors[sector].is_metadata = false;
+			stripe->sectors[sector].csum = NULL;
+		}
+		set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state);
+	}
+	sctx->cur_stripe += num_stripes;
+	return 0;
+}
+
 /*
  * Unlike queue_scrub_stripe() which only queue one stripe, this queue all
  * mirrors for non-RAID56 profiles, or the full stripe for RAID56.
  */
 static int queue_scrub_logical_stripes(struct scrub_ctx *sctx,
-			struct btrfs_block_group *bg, u64 logical)
+			struct btrfs_block_group *bg, struct map_lookup *map,
+			u64 logical)
 {
 	const int nr_copies = btrfs_bg_type_to_factor(bg->flags);
 	const int old_cur = sctx->cur_stripe;
+	bool is_raid56 = bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK;
 	struct scrub_stripe *stripe;
 	int ret;
 
@@ -3168,6 +3262,14 @@ static int queue_scrub_logical_stripes(struct scrub_ctx *sctx,
 			return ret;
 	}
 
+	/*
+	 * For RAID56, we need to queue the full stripe. If there is some data
+	 * stripes containing no sectors, we still need to queue those empty
+	 * stripes for later recovery and P/Q verification.
+	 */
+	if (is_raid56)
+		return queue_raid56_full_stripes(sctx, bg, map, logical);
+
 	stripe = &sctx->stripes[old_cur];
 
 	scrub_reset_stripe(stripe);
@@ -3250,7 +3352,7 @@ static int scrub_logical_one_chunk(struct scrub_ctx *sctx,
 		}
 		spin_unlock(&bg->lock);
 
-		ret = queue_scrub_logical_stripes(sctx, bg, cur);
+		ret = queue_scrub_logical_stripes(sctx, bg, em->map_lookup, cur);
 		if (ret > 0) {
 			ret = 0;
 			break;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 13/14] btrfs: scrub: introduce the RAID56 data recovery path for scrub logical
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (11 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 12/14] btrfs: scrub: add RAID56 support for queue_scrub_logical_stripes() Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03  7:32 ` [PATCH 14/14] btrfs: scrub: implement the RAID56 P/Q scrub code Qu Wenruo
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

The recovery path would use cached stripe contents to make sure we won't
waste IO on re-read the same contents from disks.

This is done by reusing raid56_parity_cache_pages() function to cache
data and P/Q stripes.

And since we need to call raid56_parity_cache_pages(), this means we
can not go the same btrfs_submit_bio() path for the repair, thus we
introduce the following two helpers:

- submit_cached_recover_bio()
  This would do the all necessary preparation: grab a bioc, rbio, then
  fill the cache and finally submit and wait the bio.

- scrub_submit_cached_raid56_repair_read()
  This is doing mostly the same work as
  scrub_stripe_submit_repair_read(), but since we can not go
  btrfs_submit_bio(), we have to go a different path.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c   | 244 +++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.c |  20 +---
 fs/btrfs/volumes.h |  16 +++
 3 files changed, 253 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f37b9fd30595..4ca502fa0b40 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -174,6 +174,7 @@ struct scrub_stripe {
 struct scrub_ctx {
 	struct scrub_stripe	*stripes;
 	struct scrub_stripe	*raid56_data_stripes;
+	struct map_lookup	*map;
 	struct btrfs_fs_info	*fs_info;
 	int			first_free;
 	int			cur_stripe;
@@ -846,6 +847,130 @@ static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
 	}
 }
 
+static void raid56_scrub_wait_endio(struct bio *bio)
+{
+	complete(bio->bi_private);
+}
+
+/*
+ * The read repair path of scrub_logical.
+ *
+ * This would use the full stripe cache as cache, so we can afford to do
+ * synchronous repair here, as everything is done inside memory.
+ */
+static void submit_cached_recover_bio(struct scrub_stripe *first_stripe,
+				      int group_stripes,
+				      struct scrub_stripe *stripe,
+				      struct bio *bio,
+				      int mirror)
+{
+	DECLARE_COMPLETION_ONSTACK(io_done);
+	struct btrfs_fs_info *fs_info = first_stripe->bg->fs_info;
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_raid_bio *rbio;
+	u64 bio_logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	u32 bio_size = bio->bi_iter.bi_size;
+	u64 length = fs_info->sectorsize;
+	unsigned long bio_bitmap = 0;
+	int sector_start = calc_sector_number(stripe, bio_first_bvec_all(bio));
+	int ret;
+
+	ASSERT(mirror > 1);
+	ASSERT(bio_size);
+	ASSERT(bio_logical >= first_stripe->logical);
+
+	bio->bi_private = &io_done;
+	bio->bi_end_io = raid56_scrub_wait_endio;
+	btrfs_bio_counter_inc_blocked(fs_info);
+	ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, first_stripe->logical,
+			      &length, &bioc, NULL, 0, 1);
+	if (ret < 0)
+		goto out;
+
+	rbio = raid56_parity_alloc_recover_rbio(bio, bioc, mirror);
+	btrfs_put_bioc(bioc);
+	if (!rbio) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (int i = 0; i < group_stripes; i++) {
+		struct scrub_stripe *stripe = first_stripe + i;
+
+		raid56_parity_cache_pages(rbio, stripe->pages, i);
+	}
+	raid56_parity_submit_recover_rbio(rbio);
+	wait_for_completion_io(&io_done);
+	ret = blk_status_to_errno(bio->bi_status);
+out:
+	btrfs_bio_counter_dec(fs_info);
+	if (ret < 0) {
+		bitmap_set(&stripe->io_error_bitmap, sector_start,
+			   bio_size >> fs_info->sectorsize_bits);
+		bitmap_set(&stripe->error_bitmap, sector_start,
+			   bio_size >> fs_info->sectorsize_bits);
+		return;
+	}
+	bitmap_set(&bio_bitmap, sector_start,
+		   bio_size >> fs_info->sectorsize_bits);
+	/* Verify the repaired sector. */
+	scrub_verify_one_stripe(stripe, bio_bitmap);
+}
+
+static void scrub_submit_cached_raid56_repair_read(
+				struct scrub_stripe *first_stripe,
+				int group_stripes,
+				struct scrub_stripe *stripe,
+				int mirror)
+{
+	struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
+	struct bio *bio = NULL;
+	int i;
+
+	ASSERT(stripe->mirror_num >= 1);
+	ASSERT(atomic_read(&stripe->pending_io) == 0);
+
+	/*
+	 * Here we will repair every sector even if it's not covered by an
+	 * extent.
+	 * This is to handle missing device, so that we won't re-generate
+	 * P/Q using garbage.
+	 */
+	for_each_set_bit(i, &stripe->error_bitmap, stripe->nr_sectors) {
+		struct page *page;
+		int pgoff;
+		int ret;
+
+		page = scrub_stripe_get_page(stripe, i);
+		pgoff = scrub_stripe_get_page_offset(stripe, i);
+
+		/* The current sector cannot be merged, submit the bio. */
+		if (bio && i > 0 && !test_bit(i - 1, &stripe->error_bitmap)) {
+			ASSERT(bio->bi_iter.bi_size);
+			submit_cached_recover_bio(first_stripe, group_stripes,
+						  stripe, bio, mirror);
+			bio_put(bio);
+			bio = NULL;
+		}
+
+		if (!bio) {
+			bio = bio_alloc(NULL, stripe->nr_sectors, REQ_OP_READ,
+					GFP_NOFS);
+			bio->bi_iter.bi_sector = (stripe->logical +
+				(i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
+		}
+
+		ret = bio_add_page(bio, page, fs_info->sectorsize, pgoff);
+		ASSERT(ret == fs_info->sectorsize);
+	}
+	if (bio) {
+		ASSERT(bio->bi_iter.bi_size);
+		submit_cached_recover_bio(first_stripe, group_stripes, stripe,
+					  bio, mirror);
+		bio_put(bio);
+	}
+}
+
 static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 				       struct scrub_stripe *stripe)
 {
@@ -1657,6 +1782,94 @@ static bool stripe_has_metadata_error(struct scrub_stripe *stripe)
 	return false;
 }
 
+static int repair_raid56_full_stripe(struct scrub_ctx *sctx, int start_stripe,
+				     int group_stripes)
+{
+	const int tolerance = btrfs_map_num_copies(sctx->map);
+	struct scrub_stripe *first_stripe = &sctx->stripes[start_stripe];
+	struct scrub_stripe *p_stripe;
+	unsigned long extents_bitmap;
+	int data_stripes = nr_data_stripes(sctx->map);
+	bool has_error = false;
+	int ret = 0;
+
+	ASSERT(start_stripe + group_stripes <= sctx->cur_stripe);
+	ASSERT(sctx->map);
+
+	/* Zoned fs doesn't support RAID56 yet. */
+	ASSERT(!btrfs_is_zoned(sctx->fs_info));
+
+	p_stripe = first_stripe + data_stripes;
+
+	bitmap_copy(&extents_bitmap, &p_stripe->extent_sector_bitmap,
+		    p_stripe->nr_sectors);
+
+	for (int stripe_nr = 0; stripe_nr < data_stripes; stripe_nr++) {
+		struct scrub_stripe *stripe = first_stripe + stripe_nr;
+		unsigned long final_errors = 0;
+
+		for (int mirror = 2; mirror <= tolerance; mirror++) {
+			scrub_submit_cached_raid56_repair_read(first_stripe,
+					group_stripes, stripe, mirror);
+		}
+		bitmap_and(&final_errors, &stripe->error_bitmap,
+			   &stripe->extent_sector_bitmap, stripe->nr_sectors);
+		if (!bitmap_empty(&final_errors, stripe->nr_sectors)) {
+			btrfs_err_rl(sctx->fs_info,
+"unrepaired sector found, full_stripe %llu stripe_num %u error_sectors %*pbl",
+				     first_stripe->logical, stripe_nr,
+				     stripe->nr_sectors, &final_errors);
+			has_error = true;
+		}
+		/*
+		 * Here we continue repairing the remaining data stripes,
+		 * This is to update the accounting to have a correct view
+		 * of the errors we have.
+		 */
+	}
+	if (has_error) {
+		btrfs_err(sctx->fs_info, "failed to repair full stripe %llu",
+			  first_stripe->logical);
+		ret = -EIO;
+		goto report;
+	}
+
+	/* Submit the write for repaired data sectors. */
+	if (!sctx->readonly) {
+		for (int stripe_nr = 0; stripe_nr < data_stripes; stripe_nr++) {
+			struct scrub_stripe *stripe = first_stripe + stripe_nr;
+			unsigned long repaired;
+
+			bitmap_andnot(&repaired, &stripe->init_error_bitmap,
+				      &stripe->error_bitmap,
+				      stripe->nr_sectors);
+			scrub_write_sectors(sctx, stripe, repaired, false);
+		}
+	}
+	/* Wait for above writeback to finish. */
+	for (int stripe_nr = 0; stripe_nr < data_stripes; stripe_nr++) {
+		struct scrub_stripe *stripe = first_stripe + stripe_nr;
+
+		wait_scrub_stripe_io(stripe);
+	}
+
+	/* Still need to scrub P/Q stripes. */
+	ret = -EOPNOTSUPP;
+report:
+	/*
+	 * Update the accounting for data stripes.
+	 *
+	 * Unfortunately the current scrub structure doesn't have the extra
+	 * members to report errors about P/Q sectors.
+	 */
+	for (int stripe_nr = 0; stripe_nr < data_stripes; stripe_nr++) {
+		struct scrub_stripe *stripe = first_stripe + stripe_nr;
+
+		scrub_stripe_report_errors(sctx, stripe);
+	}
+	return ret;
+}
+
 /*
  * Unlike the per-device repair, we have all mirrors read out already.
  *
@@ -1750,8 +1963,14 @@ static int handle_logical_stripes(struct scrub_ctx *sctx,
 				  struct btrfs_block_group *bg)
 {
 	const int nr_stripes = sctx->cur_stripe;
-	const int group_stripes = btrfs_bg_type_to_factor(bg->flags);
+	const bool is_raid56 = bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK;
+	const int group_stripes = is_raid56 ? sctx->map->num_stripes :
+					      btrfs_bg_type_to_factor(bg->flags);
 	struct scrub_stripe *stripe;
+	int ret = 0;
+
+	/* For scrub_logical, sctx->map should be set. */
+	ASSERT(sctx->map);
 
 	for (int i = 0; i < nr_stripes; i++) {
 		stripe = &sctx->stripes[i];
@@ -1773,11 +1992,21 @@ static int handle_logical_stripes(struct scrub_ctx *sctx,
 			bitmap_weight(&stripe->meta_error_bitmap, stripe->nr_sectors);
 	}
 
-	for (int i = 0; i < nr_stripes; i += group_stripes)
-		repair_one_mirror_group(sctx, i, group_stripes);
+
+	for (int i = 0; i < nr_stripes; i += group_stripes) {
+		int tmp;
+
+		if (is_raid56) {
+			tmp = repair_raid56_full_stripe(sctx, i, group_stripes);
+			if (!ret)
+				ret = tmp;
+		} else {
+			repair_one_mirror_group(sctx, i, group_stripes);
+		}
+	}
 	sctx->cur_stripe = 0;
 
-	return 0;
+	return ret;
 }
 
 static int flush_scrub_stripes(struct scrub_ctx *sctx,
@@ -1874,11 +2103,6 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx,
 	return ret;
 }
 
-static void raid56_scrub_wait_endio(struct bio *bio)
-{
-	complete(bio->bi_private);
-}
-
 static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *bg,
 			      struct btrfs_device *dev, int mirror_num,
 			      u64 logical, u32 length, u64 physical)
@@ -3317,6 +3541,7 @@ static int scrub_logical_one_chunk(struct scrub_ctx *sctx,
 		spin_unlock(&bg->lock);
 		return ret;
 	}
+	sctx->map = em->map_lookup;
 
 	scrub_blocked_if_needed(fs_info);
 
@@ -3368,6 +3593,7 @@ static int scrub_logical_one_chunk(struct scrub_ctx *sctx,
 		ret = flush_ret;
 	free_scrub_stripes(sctx);
 	free_extent_map(em);
+	sctx->map = NULL;
 	return ret;
 
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c22007bd830b..1f79180dab83 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5752,22 +5752,6 @@ void btrfs_mapping_tree_free(struct extent_map_tree *tree)
 	}
 }
 
-static int map_num_copies(const struct map_lookup *map)
-{
-	if (map->type & BTRFS_BLOCK_GROUP_RAID5)
-		return 2;
-	if (map->type & BTRFS_BLOCK_GROUP_RAID6)
-		/*
-		 * There could be two corrupted data stripes, we need
-		 * to loop retry in order to rebuild the correct data.
-		 *
-		 * Fail a stripe at a time on every retry except the
-		 * stripe under reconstruction.
-		 */
-		return map->num_stripes;
-	return btrfs_bg_type_to_factor(map->type);
-}
-
 int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 {
 	struct extent_map *em;
@@ -5783,7 +5767,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 		 */
 		return 1;
 
-	ret = map_num_copies(em->map_lookup);
+	ret = btrfs_map_num_copies(em->map_lookup);
 	free_extent_map(em);
 	return ret;
 }
@@ -6294,7 +6278,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		return PTR_ERR(em);
 	map = em->map_lookup;
 
-	if (mirror_num > map_num_copies(map)) {
+	if (mirror_num > btrfs_map_num_copies(map)) {
 		ret = -EINVAL;
 		goto out_free_em;
 	}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f4f410550306..0753675a2c5b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -752,4 +752,20 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
 
 bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
 
+static inline int btrfs_map_num_copies(struct map_lookup *map)
+{
+	if (map->type & BTRFS_BLOCK_GROUP_RAID5)
+		return 2;
+	if (map->type & BTRFS_BLOCK_GROUP_RAID6)
+		/*
+		 * There could be two corrupted data stripes, we need
+		 * to loop retry in order to rebuild the correct data.
+		 *
+		 * Fail a stripe at a time on every retry except the
+		 * stripe under reconstruction.
+		 */
+		return map->num_stripes;
+	return btrfs_bg_type_to_factor(map->type);
+}
+
 #endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 14/14] btrfs: scrub: implement the RAID56 P/Q scrub code
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (12 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 13/14] btrfs: scrub: introduce the RAID56 data recovery path for scrub logical Qu Wenruo
@ 2023-07-03  7:32 ` Qu Wenruo
  2023-07-03 12:58 ` [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Graham Cobb
  2023-07-13 12:14 ` David Sterba
  15 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  7:32 UTC (permalink / raw)
  To: linux-btrfs

For RAID56 P/Q stripe verification, we already have an existing path
inside scrub_raid56_parity_stripe().

Although the existing path is only to verify one P/Q stripe, with recent
changes to scrub_rbio(), we only need to pass a NULL @scrub_dev to
raid56_parity_alloc_scrub_rbio(), then we can verify both P/Q stripes in
one go.

So here we introduce a helper, submit_raid56_pq_scrub_bio(), to do the
necessary work of P/Q stripes verification and repair.

The new helper would be utilized by both scrub_logical and the existing
per-device scrub, and both would use the repaired data stripes to reduce
IO for RAID56 scrub.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 106 +++++++++++++++++++++++++++--------------------
 1 file changed, 62 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 4ca502fa0b40..2a882a07b388 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1782,6 +1782,58 @@ static bool stripe_has_metadata_error(struct scrub_stripe *stripe)
 	return false;
 }
 
+static int submit_raid56_pq_scrub_bio(struct scrub_ctx *sctx,
+				      struct map_lookup *map,
+				      struct scrub_stripe *first_data_stripe,
+				      int group_stripes,
+				      struct btrfs_device *scrub_dev,
+				      unsigned long extent_bitmap)
+{
+	DECLARE_COMPLETION_ONSTACK(io_done);
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_raid_bio *rbio;
+	struct bio *bio;
+	const int data_stripes = nr_data_stripes(map);
+	u64 full_stripe_start = first_data_stripe->logical;
+	u64 length = btrfs_stripe_nr_to_offset(data_stripes);
+	int ret;
+
+	bio = bio_alloc(NULL, 1, REQ_OP_READ, GFP_NOFS);
+	bio->bi_iter.bi_sector = full_stripe_start >> SECTOR_SHIFT;
+	bio->bi_private = &io_done;
+	bio->bi_end_io = raid56_scrub_wait_endio;
+
+	btrfs_bio_counter_inc_blocked(fs_info);
+	ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, full_stripe_start,
+			      &length, &bioc, NULL, NULL, 1);
+	if (ret < 0) {
+		btrfs_put_bioc(bioc);
+		btrfs_bio_counter_dec(fs_info);
+		return ret;
+	}
+	rbio = raid56_parity_alloc_scrub_rbio(bio, bioc, scrub_dev, &extent_bitmap);
+	btrfs_put_bioc(bioc);
+	if (!rbio) {
+		ret = -ENOMEM;
+		btrfs_bio_counter_dec(fs_info);
+		return ret;
+	}
+	/* Use the recovered stripes as cache to avoid read them from disk again. */
+	ASSERT(group_stripes <= map->num_stripes);
+	for (int i = 0; i < group_stripes; i++) {
+		struct scrub_stripe *stripe = first_data_stripe + i;
+
+		raid56_parity_cache_pages(rbio, stripe->pages, i);
+	}
+	raid56_parity_submit_scrub_rbio(rbio);
+	wait_for_completion_io(&io_done);
+	ret = blk_status_to_errno(bio->bi_status);
+	bio_put(bio);
+	btrfs_bio_counter_dec(fs_info);
+	return 0;
+}
+
 static int repair_raid56_full_stripe(struct scrub_ctx *sctx, int start_stripe,
 				     int group_stripes)
 {
@@ -1853,8 +1905,14 @@ static int repair_raid56_full_stripe(struct scrub_ctx *sctx, int start_stripe,
 		wait_scrub_stripe_io(stripe);
 	}
 
-	/* Still need to scrub P/Q stripes. */
-	ret = -EOPNOTSUPP;
+	/*
+	 * All data stripes repaired, now can generate and verify P/Q stripes.
+	 *
+	 * Pass NULL as @scrub_dev, so that both P/Q stripes would be verified.
+	 */
+	ret = submit_raid56_pq_scrub_bio(sctx, sctx->map, first_stripe,
+					 sctx->map->num_stripes, NULL,
+					 extents_bitmap);
 report:
 	/*
 	 * Update the accounting for data stripes.
@@ -2141,14 +2199,10 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 {
 	DECLARE_COMPLETION_ONSTACK(io_done);
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
-	struct btrfs_raid_bio *rbio;
-	struct btrfs_io_context *bioc = NULL;
-	struct bio *bio;
 	struct scrub_stripe *stripe;
 	bool all_empty = true;
 	const int data_stripes = nr_data_stripes(map);
 	unsigned long extent_bitmap = 0;
-	u64 length = btrfs_stripe_nr_to_offset(data_stripes);
 	int ret;
 
 	ASSERT(sctx->raid56_data_stripes);
@@ -2261,38 +2315,8 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 	}
 
 	/* Now we can check and regenerate the P/Q stripe. */
-	bio = bio_alloc(NULL, 1, REQ_OP_READ, GFP_NOFS);
-	bio->bi_iter.bi_sector = full_stripe_start >> SECTOR_SHIFT;
-	bio->bi_private = &io_done;
-	bio->bi_end_io = raid56_scrub_wait_endio;
-
-	btrfs_bio_counter_inc_blocked(fs_info);
-	ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, full_stripe_start,
-			      &length, &bioc, NULL, NULL, 1);
-	if (ret < 0) {
-		btrfs_put_bioc(bioc);
-		btrfs_bio_counter_dec(fs_info);
-		goto out;
-	}
-	rbio = raid56_parity_alloc_scrub_rbio(bio, bioc, scrub_dev, &extent_bitmap);
-	btrfs_put_bioc(bioc);
-	if (!rbio) {
-		ret = -ENOMEM;
-		btrfs_bio_counter_dec(fs_info);
-		goto out;
-	}
-	/* Use the recovered stripes as cache to avoid read them from disk again. */
-	for (int i = 0; i < data_stripes; i++) {
-		stripe = &sctx->raid56_data_stripes[i];
-
-		raid56_parity_cache_pages(rbio, stripe->pages, i);
-	}
-	raid56_parity_submit_scrub_rbio(rbio);
-	wait_for_completion_io(&io_done);
-	ret = blk_status_to_errno(bio->bi_status);
-	bio_put(bio);
-	btrfs_bio_counter_dec(fs_info);
-
+	ret = submit_raid56_pq_scrub_bio(sctx, map, sctx->raid56_data_stripes,
+					 data_stripes, scrub_dev, extent_bitmap);
 out:
 	return ret;
 }
@@ -3545,12 +3569,6 @@ static int scrub_logical_one_chunk(struct scrub_ctx *sctx,
 
 	scrub_blocked_if_needed(fs_info);
 
-	/* RAID56 not yet supported. */
-	if (bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-		ret = -EOPNOTSUPP;
-		goto out;
-	}
-
 	nr_stripes = SCRUB_STRIPES_PER_SCTX * nr_copies;
 	ret = alloc_scrub_stripes(sctx, nr_stripes);
 	if (ret < 0)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 05/14] btrfs: add the ability to read P/Q stripes directly
  2023-07-03  7:32 ` [PATCH 05/14] btrfs: add the ability to read P/Q stripes directly Qu Wenruo
@ 2023-07-03  9:46   ` Qu Wenruo
  0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03  9:46 UTC (permalink / raw)
  To: linux-btrfs



On 2023/7/3 15:32, Qu Wenruo wrote:
> Currently there is no way to read P/Q stripes of a RAID56 full stripe
> directly.
> 
> Normally caller should call btrfs_map_block() directly and fetch the
> physical location directly of the P/Q stripes.
> 
> But for the recent scrub rework, it's no longer that elegant as the full
> scrub code is based on mirror_num based solution, thus this patch would
> add two special mirror_num for this usages:
> 
> - Mirror -1 for P stripes
> - Mirror -2 for Q stripes
> 
> Both mirrors only support read for now, and caller should make sure the
> range start is stripe aligned.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/scrub.c   |  2 +-
>   fs/btrfs/volumes.c | 30 +++++++++++++++++++++++++++++-
>   2 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 1864856abb13..41c514db0793 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1601,7 +1601,7 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
>   	int mirror = stripe->mirror_num;
>   
>   	ASSERT(stripe->bg);
> -	ASSERT(stripe->mirror_num > 0);
> +	ASSERT(stripe->mirror_num);
>   	ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
>   
>   	bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index cd632b3f579f..c22007bd830b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6251,10 +6251,12 @@ static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *
>    *
>    *			For RAID56 profile, mirror 1 means rebuild from P and
>    *			the remaining data stripes.
> + *			And mirror -1 means read P stripes directly, -2 means
> + *			read Q stripes directly.
>    *
>    *			For RAID6 profile, mirror > 2 means mark another
>    *			data/P stripe error and rebuild from the remaining
> - *			stripes..
> + *			stripes.
>    *
>    * @need_raid_map:	(Used only for integrity checker) whether the map wants
>    *                      a full stripe map (including all data and P/Q stripes)
> @@ -6297,6 +6299,12 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>   		goto out_free_em;
>   	}
>   
> +	/* Only allow mirror_num < 0 for RAID56. */
> +	if (mirror_num < 0 && !(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)) {
> +		free_extent_map(em);
> +		return -EINVAL;
> +	}
> +
>   	data_stripes = nr_data_stripes(map);
>   
>   	map_offset = logical - em->start;
> @@ -6382,6 +6390,26 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>   				  logical;
>   			stripe_index = 0;
>   			stripe_offset = 0;
> +		} else if (mirror_num < 0) {
> +			/* Only allow READ for direct P/Q operations. */
> +			ASSERT(op == BTRFS_MAP_READ);
> +			/*
> +			 * Caller should make sure the range start is stripe
> +			 * aligned.
> +			 */
> +			ASSERT(stripe_offset == 0);
> +
> +			/*
> +			 * Stripes of @bioc is already sorted, stripes[0] is the
> +			 * first data stripe and stripes[@data_stripes] is the
> +			 * P stripe.
> +			 * So we only need to update the @stripe_index to the
> +			 * specified stripe, and set @stripe_nr/@stripe_offset
> +			 * to 0, so we can return the beginning of the P/Q stripe.
> +			 */
> +			stripe_offset = 0;
> +			stripe_nr = 0;

Oh, the stripe_nr calculation is wrong, it still needs to take full 
stripe number into consideration.

Would update this patch in my branch.

Thanks,
Qu

> +			stripe_index = data_stripes + (-mirror_num - 1);
>   		} else {
>   			/*
>   			 * Mirror #0 or #1 means the original data block.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio()
  2023-07-03  7:32 ` [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio() Qu Wenruo
@ 2023-07-03 12:33   ` Anand Jain
  2023-07-12 16:23   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Anand Jain @ 2023-07-03 12:33 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 3/7/23 15:32, Qu Wenruo wrote:
> The parameter @stripe_nsectors is easily calculated using
> BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits, and is already cached in
> rbio->stripe_nsectors.
> 
> Thus we don't need to pass that parameter from the caller, and can
> remove it safely.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Anand Jain <anand.jain@oracle.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (13 preceding siblings ...)
  2023-07-03  7:32 ` [PATCH 14/14] btrfs: scrub: implement the RAID56 P/Q scrub code Qu Wenruo
@ 2023-07-03 12:58 ` Graham Cobb
  2023-07-03 22:40   ` Qu Wenruo
  2023-07-13 12:14 ` David Sterba
  15 siblings, 1 reply; 24+ messages in thread
From: Graham Cobb @ 2023-07-03 12:58 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 03/07/2023 08:32, Qu Wenruo wrote:
> [CHANGELOG]
> RFC->v1:
> - Add RAID56 support
>   Which is the biggest advantage of the new scrub mode.
> 
> - More basic tests around various repair situations
> 
> [REPO]
> Please fetch from github repo:
> https://github.com/adam900710/linux/tree/scrub_logical
> 
> This is based on David's misc-next, but still has one extra regression
> fix which is not yet merged. ("btrfs: raid56: always verify the P/Q
> contents for scrub")
> 
> [BACKGROUND]
> Scrub originally works in a per-device basis, that means if we want to
> scrub the full fs, we would queue a scrub for each writeable device.
> 
> This is normally fine, but some behavior is not ideal like the
> following:
> 		X	X+16K	X+32K
>  Mirror 1	|XXXXXXX|       |
>  Mirror 2	|       |XXXXXXX|
> 
> When scrubbing mirror 1, we found [X, X+16K) has corruption, then we
> will try to read from mirror 2 and repair using the correct data.
> 
> This applies to range [X+16K, X+32K) of mirror 2, causing the good copy
> to be read twice, which may or may not be a big deal.
> 
> But the problem can easily go crazy for RAID56, as its repair requires
> the full stripe to be read out, so is its P/Q verification, e.g.:
> 
> 		X	X+16K	X+32K
>  Data stripe 1	|XXXXXXX|       |	|	|
>  Data stripe 2	|       |XXXXXXX|	|	|
>  Parity stripe	|       |	|XXXXXXX|	|
> 
> In above case, if we're scrubbing all mirrors, we would read the same
> contents again and again:
> 
> Scrub on mirror 1:
> - Read data stripe 1 for the initial read.
> - Read data stripes 1 + 2 + parity for the rebuild.
> 
> Scrub on mirror 2:
> - Read data stripe 2 for the initial read.
> - Read data stripes 1 + 2 + parity for the rebuild.
> 
> Scrub on Parity
> - Read data stripes 1 + 2 for the data stripes verification
> - Read data stripes 1 + 2 + parity for the data rebuild
>   This step is already improved by recent optimization to use cached
>   stripes.
> - Read the parity stripe for the final verification
> 
> So for data stripes, they are read *five* times, and *three* times for
> parity stripes.
> 
> [ENHANCEMENT]
> Instead of the existing per-device scrub, this patchset introduce a new
> flag, SCRUB_LOGICAL, to do logical address space based scrub.
> 
> Unlike per-device scrub, this new flag would make scrub a per-fs
> operation.
> 
> This allows us to scrub the different mirrors in one go, and avoid
> unnecessary read to repair the corruption.
> 
> This means, no matter what profile, they always read the same data just
> once.
> 
> This also makes user space handling much simpler, just one ioctl to
> scrub the full fs, without the current multi-thread scrub code.

I have a comment on terminology. If I understand correctly, this flag
changes scrub from an operation performed in parallel on all devices, to
one done sequentially, with less parallelism.

The original code scrubs every device at the same time. In very rough
terms, for a filesystem with more devices than copies, the duration for
a scrub with no errors is the time taken to read every block on the most
occupied device. Other disks will finish earlier.

In the same case, the new code will take the time taken to read one
block from every file (not just those on the most occupied disk). It is
not clear to me how much parallelism will occur in this case.

I am not saying it isn't worth doing, but that it may be best to explain
it in terms of parallel vs. sequential rather than physical vs. logical.
Certainly in making the user documentation, and scrub command line,
clear to the user and possibly even in the naming of the flag (e.g.
SCRUB_SEQUENTIAL instead of SCRUB_LOGICAL).

> 
> [TODO]
> - More testing
>   Currently only done functional tests, no stress tests yet.
> 
> - Better progs integration
>   In theory we can silently try SCRUB_LOGICAL first, if the kernel
>   doesn't support it, then fallback to the old multi-device scrub.

Maybe not if my understanding is correct. On filesystems with more disks
than copies the new code may take noticeably longer?

Or do I misunderstand?

Graham

> 
>   But for current testing, a dedicated option is still assigned for
>   scrub subcommand.
> 
>   And currently it only supports full report, no summary nor scrub file
>   support.
> 
> Qu Wenruo (14):
>   btrfs: raid56: remove unnecessary parameter for
>     raid56_parity_alloc_scrub_rbio()
>   btrfs: raid56: allow scrub operation to update both P and Q stripes
>   btrfs: raid56: allow caching P/Q stripes
>   btrfs: raid56: add the interfaces to submit recovery rbio
>   btrfs: add the ability to read P/Q stripes directly
>   btrfs: scrub: make scrub_ctx::stripes dynamically allocated
>   btrfs: scrub: introduce the skeleton for logical-scrub
>   btrfs: scrub: extract the common preparation before scrubbing a block
>     group
>   btrfs: scrub: implement the chunk iteration code for scrub_logical
>   btrfs: scrub: implement the basic extent iteration code
>   btrfs: scrub: implement the repair part of scrub logical
>   btrfs: scrub: add RAID56 support for queue_scrub_logical_stripes()
>   btrfs: scrub: introduce the RAID56 data recovery path for scrub
>     logical
>   btrfs: scrub: implement the RAID56 P/Q scrub code
> 
>  fs/btrfs/disk-io.c         |    1 +
>  fs/btrfs/fs.h              |   12 +
>  fs/btrfs/ioctl.c           |    6 +-
>  fs/btrfs/raid56.c          |  134 +++-
>  fs/btrfs/raid56.h          |   17 +-
>  fs/btrfs/scrub.c           | 1198 ++++++++++++++++++++++++++++++------
>  fs/btrfs/scrub.h           |    2 +
>  fs/btrfs/volumes.c         |   50 +-
>  fs/btrfs/volumes.h         |   16 +
>  include/uapi/linux/btrfs.h |   11 +-
>  10 files changed, 1206 insertions(+), 241 deletions(-)
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag
  2023-07-03 12:58 ` [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Graham Cobb
@ 2023-07-03 22:40   ` Qu Wenruo
  2023-07-03 23:19     ` Graham Cobb
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2023-07-03 22:40 UTC (permalink / raw)
  To: Graham Cobb, Qu Wenruo, linux-btrfs



On 2023/7/3 20:58, Graham Cobb wrote:
> On 03/07/2023 08:32, Qu Wenruo wrote:
>> [CHANGELOG]
>> RFC->v1:
>> - Add RAID56 support
>>    Which is the biggest advantage of the new scrub mode.
>>
>> - More basic tests around various repair situations
>>
>> [REPO]
>> Please fetch from github repo:
>> https://github.com/adam900710/linux/tree/scrub_logical
>>
>> This is based on David's misc-next, but still has one extra regression
>> fix which is not yet merged. ("btrfs: raid56: always verify the P/Q
>> contents for scrub")
>>
>> [BACKGROUND]
>> Scrub originally works in a per-device basis, that means if we want to
>> scrub the full fs, we would queue a scrub for each writeable device.
>>
>> This is normally fine, but some behavior is not ideal like the
>> following:
>> 		X	X+16K	X+32K
>>   Mirror 1	|XXXXXXX|       |
>>   Mirror 2	|       |XXXXXXX|
>>
>> When scrubbing mirror 1, we found [X, X+16K) has corruption, then we
>> will try to read from mirror 2 and repair using the correct data.
>>
>> This applies to range [X+16K, X+32K) of mirror 2, causing the good copy
>> to be read twice, which may or may not be a big deal.
>>
>> But the problem can easily go crazy for RAID56, as its repair requires
>> the full stripe to be read out, so is its P/Q verification, e.g.:
>>
>> 		X	X+16K	X+32K
>>   Data stripe 1	|XXXXXXX|       |	|	|
>>   Data stripe 2	|       |XXXXXXX|	|	|
>>   Parity stripe	|       |	|XXXXXXX|	|
>>
>> In above case, if we're scrubbing all mirrors, we would read the same
>> contents again and again:
>>
>> Scrub on mirror 1:
>> - Read data stripe 1 for the initial read.
>> - Read data stripes 1 + 2 + parity for the rebuild.
>>
>> Scrub on mirror 2:
>> - Read data stripe 2 for the initial read.
>> - Read data stripes 1 + 2 + parity for the rebuild.
>>
>> Scrub on Parity
>> - Read data stripes 1 + 2 for the data stripes verification
>> - Read data stripes 1 + 2 + parity for the data rebuild
>>    This step is already improved by recent optimization to use cached
>>    stripes.
>> - Read the parity stripe for the final verification
>>
>> So for data stripes, they are read *five* times, and *three* times for
>> parity stripes.
>>
>> [ENHANCEMENT]
>> Instead of the existing per-device scrub, this patchset introduce a new
>> flag, SCRUB_LOGICAL, to do logical address space based scrub.
>>
>> Unlike per-device scrub, this new flag would make scrub a per-fs
>> operation.
>>
>> This allows us to scrub the different mirrors in one go, and avoid
>> unnecessary read to repair the corruption.
>>
>> This means, no matter what profile, they always read the same data just
>> once.
>>
>> This also makes user space handling much simpler, just one ioctl to
>> scrub the full fs, without the current multi-thread scrub code.
>
> I have a comment on terminology. If I understand correctly, this flag
> changes scrub from an operation performed in parallel on all devices, to
> one done sequentially, with less parallelism.

It depends.

There are several different factors affecting the concurrency even for
the mentioned cases.

Like the race of different devices for the dev-extent and csum tree lookup.

In fact for nr_devices > mirrors (aka, RAID1*, DUP, SINGLE) profiles,
the difference scrub pace can lead to more problems like too many block
group mark RO.

For this situation it's really hard to say the impact, but I'm
optimistic on the performance, and it would be finally determine by real
world benchmark.

Please remember that, the worst case profiles are mostly only utilized
by metadata, while for data it's way more common to go RAID0/RAID10 or
even RAID5/6 for adventurous users, which the new interface is way
better for those profiles.

>
> The original code scrubs every device at the same time. In very rough
> terms, for a filesystem with more devices than copies, the duration for
> a scrub with no errors is the time taken to read every block on the most
> occupied device. Other disks will finish earlier.
>
> In the same case, the new code will take the time taken to read one
> block from every file (not just those on the most occupied disk). It is
> not clear to me how much parallelism will occur in this case.
>
> I am not saying it isn't worth doing, but that it may be best to explain
> it in terms of parallel vs. sequential rather than physical vs. logical.
> Certainly in making the user documentation, and scrub command line,
> clear to the user and possibly even in the naming of the flag (e.g.
> SCRUB_SEQUENTIAL instead of SCRUB_LOGICAL).
>
>>
>> [TODO]
>> - More testing
>>    Currently only done functional tests, no stress tests yet.
>>
>> - Better progs integration
>>    In theory we can silently try SCRUB_LOGICAL first, if the kernel
>>    doesn't support it, then fallback to the old multi-device scrub.
>
> Maybe not if my understanding is correct. On filesystems with more disks
> than copies the new code may take noticeably longer?

Not really.

I'd like to have some real world benchmark for these cases instead.

Thanks,
Qu

>
> Or do I misunderstand?
>
> Graham
>
>>
>>    But for current testing, a dedicated option is still assigned for
>>    scrub subcommand.
>>
>>    And currently it only supports full report, no summary nor scrub file
>>    support.
>>
>> Qu Wenruo (14):
>>    btrfs: raid56: remove unnecessary parameter for
>>      raid56_parity_alloc_scrub_rbio()
>>    btrfs: raid56: allow scrub operation to update both P and Q stripes
>>    btrfs: raid56: allow caching P/Q stripes
>>    btrfs: raid56: add the interfaces to submit recovery rbio
>>    btrfs: add the ability to read P/Q stripes directly
>>    btrfs: scrub: make scrub_ctx::stripes dynamically allocated
>>    btrfs: scrub: introduce the skeleton for logical-scrub
>>    btrfs: scrub: extract the common preparation before scrubbing a block
>>      group
>>    btrfs: scrub: implement the chunk iteration code for scrub_logical
>>    btrfs: scrub: implement the basic extent iteration code
>>    btrfs: scrub: implement the repair part of scrub logical
>>    btrfs: scrub: add RAID56 support for queue_scrub_logical_stripes()
>>    btrfs: scrub: introduce the RAID56 data recovery path for scrub
>>      logical
>>    btrfs: scrub: implement the RAID56 P/Q scrub code
>>
>>   fs/btrfs/disk-io.c         |    1 +
>>   fs/btrfs/fs.h              |   12 +
>>   fs/btrfs/ioctl.c           |    6 +-
>>   fs/btrfs/raid56.c          |  134 +++-
>>   fs/btrfs/raid56.h          |   17 +-
>>   fs/btrfs/scrub.c           | 1198 ++++++++++++++++++++++++++++++------
>>   fs/btrfs/scrub.h           |    2 +
>>   fs/btrfs/volumes.c         |   50 +-
>>   fs/btrfs/volumes.h         |   16 +
>>   include/uapi/linux/btrfs.h |   11 +-
>>   10 files changed, 1206 insertions(+), 241 deletions(-)
>>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag
  2023-07-03 22:40   ` Qu Wenruo
@ 2023-07-03 23:19     ` Graham Cobb
  0 siblings, 0 replies; 24+ messages in thread
From: Graham Cobb @ 2023-07-03 23:19 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 03/07/2023 23:40, Qu Wenruo wrote:
> 
> 
> On 2023/7/3 20:58, Graham Cobb wrote:
>> On 03/07/2023 08:32, Qu Wenruo wrote:
>>> [CHANGELOG]
>>> RFC->v1:
>>> - Add RAID56 support
>>>    Which is the biggest advantage of the new scrub mode.
>>>
>>> - More basic tests around various repair situations
>>>
>>> [REPO]
>>> Please fetch from github repo:
>>> https://github.com/adam900710/linux/tree/scrub_logical
>>>
>>> This is based on David's misc-next, but still has one extra regression
>>> fix which is not yet merged. ("btrfs: raid56: always verify the P/Q
>>> contents for scrub")
>>>
>>> [BACKGROUND]
>>> Scrub originally works in a per-device basis, that means if we want to
>>> scrub the full fs, we would queue a scrub for each writeable device.
>>>
>>> This is normally fine, but some behavior is not ideal like the
>>> following:
>>>         X    X+16K    X+32K
>>>   Mirror 1    |XXXXXXX|       |
>>>   Mirror 2    |       |XXXXXXX|
>>>
>>> When scrubbing mirror 1, we found [X, X+16K) has corruption, then we
>>> will try to read from mirror 2 and repair using the correct data.
>>>
>>> This applies to range [X+16K, X+32K) of mirror 2, causing the good copy
>>> to be read twice, which may or may not be a big deal.
>>>
>>> But the problem can easily go crazy for RAID56, as its repair requires
>>> the full stripe to be read out, so is its P/Q verification, e.g.:
>>>
>>>         X    X+16K    X+32K
>>>   Data stripe 1    |XXXXXXX|       |    |    |
>>>   Data stripe 2    |       |XXXXXXX|    |    |
>>>   Parity stripe    |       |    |XXXXXXX|    |
>>>
>>> In above case, if we're scrubbing all mirrors, we would read the same
>>> contents again and again:
>>>
>>> Scrub on mirror 1:
>>> - Read data stripe 1 for the initial read.
>>> - Read data stripes 1 + 2 + parity for the rebuild.
>>>
>>> Scrub on mirror 2:
>>> - Read data stripe 2 for the initial read.
>>> - Read data stripes 1 + 2 + parity for the rebuild.
>>>
>>> Scrub on Parity
>>> - Read data stripes 1 + 2 for the data stripes verification
>>> - Read data stripes 1 + 2 + parity for the data rebuild
>>>    This step is already improved by recent optimization to use cached
>>>    stripes.
>>> - Read the parity stripe for the final verification
>>>
>>> So for data stripes, they are read *five* times, and *three* times for
>>> parity stripes.
>>>
>>> [ENHANCEMENT]
>>> Instead of the existing per-device scrub, this patchset introduce a new
>>> flag, SCRUB_LOGICAL, to do logical address space based scrub.
>>>
>>> Unlike per-device scrub, this new flag would make scrub a per-fs
>>> operation.
>>>
>>> This allows us to scrub the different mirrors in one go, and avoid
>>> unnecessary read to repair the corruption.
>>>
>>> This means, no matter what profile, they always read the same data just
>>> once.
>>>
>>> This also makes user space handling much simpler, just one ioctl to
>>> scrub the full fs, without the current multi-thread scrub code.
>>
>> I have a comment on terminology. If I understand correctly, this flag
>> changes scrub from an operation performed in parallel on all devices, to
>> one done sequentially, with less parallelism.
> 
> It depends.
> 
> There are several different factors affecting the concurrency even for
> the mentioned cases.
> 
> Like the race of different devices for the dev-extent and csum tree lookup.
> 
> In fact for nr_devices > mirrors (aka, RAID1*, DUP, SINGLE) profiles,
> the difference scrub pace can lead to more problems like too many block
> group mark RO.
> 
> For this situation it's really hard to say the impact, but I'm
> optimistic on the performance, and it would be finally determine by real
> world benchmark.
> 
> Please remember that, the worst case profiles are mostly only utilized
> by metadata, while for data it's way more common to go RAID0/RAID10 or
> even RAID5/6 for adventurous users, which the new interface is way
> better for those profiles.

It would be great to see some real measurements, of course, but I am
thinking of cases like RAID1 data on 3, 4 or even more disks. It feels
like a RAID1 filesystem with 4 disks might take up to twice as long with
logical address-based scrub instead of per-device scrub. And I can't
even begin to visualise how it would perform with RAID5 data over 10
disks, say.

But getting benchmarks and practical experience is obviously a good
idea. Just please don't take control of which algorithm to use away from
the system manager without being sure it is really better.

Graham

> 
>>
>> The original code scrubs every device at the same time. In very rough
>> terms, for a filesystem with more devices than copies, the duration for
>> a scrub with no errors is the time taken to read every block on the most
>> occupied device. Other disks will finish earlier.
>>
>> In the same case, the new code will take the time taken to read one
>> block from every file (not just those on the most occupied disk). It is
>> not clear to me how much parallelism will occur in this case.
>>
>> I am not saying it isn't worth doing, but that it may be best to explain
>> it in terms of parallel vs. sequential rather than physical vs. logical.
>> Certainly in making the user documentation, and scrub command line,
>> clear to the user and possibly even in the naming of the flag (e.g.
>> SCRUB_SEQUENTIAL instead of SCRUB_LOGICAL).
>>
>>>
>>> [TODO]
>>> - More testing
>>>    Currently only done functional tests, no stress tests yet.
>>>
>>> - Better progs integration
>>>    In theory we can silently try SCRUB_LOGICAL first, if the kernel
>>>    doesn't support it, then fallback to the old multi-device scrub.
>>
>> Maybe not if my understanding is correct. On filesystems with more disks
>> than copies the new code may take noticeably longer?
> 
> Not really.
> 
> I'd like to have some real world benchmark for these cases instead.
> 
> Thanks,
> Qu
> 
>>
>> Or do I misunderstand?
>>
>> Graham
>>
>>>
>>>    But for current testing, a dedicated option is still assigned for
>>>    scrub subcommand.
>>>
>>>    And currently it only supports full report, no summary nor scrub file
>>>    support.
>>>
>>> Qu Wenruo (14):
>>>    btrfs: raid56: remove unnecessary parameter for
>>>      raid56_parity_alloc_scrub_rbio()
>>>    btrfs: raid56: allow scrub operation to update both P and Q stripes
>>>    btrfs: raid56: allow caching P/Q stripes
>>>    btrfs: raid56: add the interfaces to submit recovery rbio
>>>    btrfs: add the ability to read P/Q stripes directly
>>>    btrfs: scrub: make scrub_ctx::stripes dynamically allocated
>>>    btrfs: scrub: introduce the skeleton for logical-scrub
>>>    btrfs: scrub: extract the common preparation before scrubbing a block
>>>      group
>>>    btrfs: scrub: implement the chunk iteration code for scrub_logical
>>>    btrfs: scrub: implement the basic extent iteration code
>>>    btrfs: scrub: implement the repair part of scrub logical
>>>    btrfs: scrub: add RAID56 support for queue_scrub_logical_stripes()
>>>    btrfs: scrub: introduce the RAID56 data recovery path for scrub
>>>      logical
>>>    btrfs: scrub: implement the RAID56 P/Q scrub code
>>>
>>>   fs/btrfs/disk-io.c         |    1 +
>>>   fs/btrfs/fs.h              |   12 +
>>>   fs/btrfs/ioctl.c           |    6 +-
>>>   fs/btrfs/raid56.c          |  134 +++-
>>>   fs/btrfs/raid56.h          |   17 +-
>>>   fs/btrfs/scrub.c           | 1198 ++++++++++++++++++++++++++++++------
>>>   fs/btrfs/scrub.h           |    2 +
>>>   fs/btrfs/volumes.c         |   50 +-
>>>   fs/btrfs/volumes.h         |   16 +
>>>   include/uapi/linux/btrfs.h |   11 +-
>>>   10 files changed, 1206 insertions(+), 241 deletions(-)
>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio()
  2023-07-03  7:32 ` [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio() Qu Wenruo
  2023-07-03 12:33   ` Anand Jain
@ 2023-07-12 16:23   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2023-07-12 16:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 02/14] btrfs: raid56: allow scrub operation to update both P and Q stripes
  2023-07-03  7:32 ` [PATCH 02/14] btrfs: raid56: allow scrub operation to update both P and Q stripes Qu Wenruo
@ 2023-07-12 16:24   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2023-07-12 16:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 03/14] btrfs: raid56: allow caching P/Q stripes
  2023-07-03  7:32 ` [PATCH 03/14] btrfs: raid56: allow caching P/Q stripes Qu Wenruo
@ 2023-07-12 16:27   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2023-07-12 16:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jul 03, 2023 at 03:32:27PM +0800, Qu Wenruo wrote:
> @@ -82,6 +82,13 @@ struct btrfs_raid_bio {
>  	 */
>  	int bio_list_bytes;
>  
> +	/*
> +	 * If this rbio is forced to use cached stripes provided by the caller.
> +	 *
> +	 * Used by scrub path to reduce IO.
> +	 */
> +	bool cached;

You probably want to move this next to the five u8 values for better
packing.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag
  2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
                   ` (14 preceding siblings ...)
  2023-07-03 12:58 ` [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Graham Cobb
@ 2023-07-13 12:14 ` David Sterba
  15 siblings, 0 replies; 24+ messages in thread
From: David Sterba @ 2023-07-13 12:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jul 03, 2023 at 03:32:24PM +0800, Qu Wenruo wrote:
> [CHANGELOG]
> RFC->v1:
> - Add RAID56 support
>   Which is the biggest advantage of the new scrub mode.

Great, thanks. The new functionality will have to wait until the scrub
performance regression is fixed though.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-07-13 12:26 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-03  7:32 [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Qu Wenruo
2023-07-03  7:32 ` [PATCH 01/14] btrfs: raid56: remove unnecessary parameter for raid56_parity_alloc_scrub_rbio() Qu Wenruo
2023-07-03 12:33   ` Anand Jain
2023-07-12 16:23   ` Christoph Hellwig
2023-07-03  7:32 ` [PATCH 02/14] btrfs: raid56: allow scrub operation to update both P and Q stripes Qu Wenruo
2023-07-12 16:24   ` Christoph Hellwig
2023-07-03  7:32 ` [PATCH 03/14] btrfs: raid56: allow caching P/Q stripes Qu Wenruo
2023-07-12 16:27   ` Christoph Hellwig
2023-07-03  7:32 ` [PATCH 04/14] btrfs: raid56: add the interfaces to submit recovery rbio Qu Wenruo
2023-07-03  7:32 ` [PATCH 05/14] btrfs: add the ability to read P/Q stripes directly Qu Wenruo
2023-07-03  9:46   ` Qu Wenruo
2023-07-03  7:32 ` [PATCH 06/14] btrfs: scrub: make scrub_ctx::stripes dynamically allocated Qu Wenruo
2023-07-03  7:32 ` [PATCH 07/14] btrfs: scrub: introduce the skeleton for logical-scrub Qu Wenruo
2023-07-03  7:32 ` [PATCH 08/14] btrfs: scrub: extract the common preparation before scrubbing a block group Qu Wenruo
2023-07-03  7:32 ` [PATCH 09/14] btrfs: scrub: implement the chunk iteration code for scrub_logical Qu Wenruo
2023-07-03  7:32 ` [PATCH 10/14] btrfs: scrub: implement the basic extent iteration code Qu Wenruo
2023-07-03  7:32 ` [PATCH 11/14] btrfs: scrub: implement the repair part of scrub logical Qu Wenruo
2023-07-03  7:32 ` [PATCH 12/14] btrfs: scrub: add RAID56 support for queue_scrub_logical_stripes() Qu Wenruo
2023-07-03  7:32 ` [PATCH 13/14] btrfs: scrub: introduce the RAID56 data recovery path for scrub logical Qu Wenruo
2023-07-03  7:32 ` [PATCH 14/14] btrfs: scrub: implement the RAID56 P/Q scrub code Qu Wenruo
2023-07-03 12:58 ` [PATCH 00/14] btrfs: scrub: introduce SCRUB_LOGICAL flag Graham Cobb
2023-07-03 22:40   ` Qu Wenruo
2023-07-03 23:19     ` Graham Cobb
2023-07-13 12:14 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.