All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs
@ 2022-09-03  8:19 Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

Depite the write-hole problem of RAID56, scrub is neither RAID56
friendly in the following points:

- Extra IO for RAID56 scrub
  Currently data strips of RAID56 can be read 2x (RAID5) or 3x (RAID6).

  This is caused by the fact we do one-thread per-device scrub.

  Dev 1    |  Data 1  | P(3 + 4) |
  Dev 2    |  Data 2  |  Data 3  |
  Dev 3    | P(1 + 2) |  Data 4  |

  When scrubbing Dev 1, we will read Data 1 (treated no differently than
  SINGLE), then read Parity (3 + 4).
  But to determine if Parity (3 + 4) is correct, we have to read Data 3
  and Data 4.

  On the other hand, Data 3 will also be read by scrubbing Dev 2,
  and Data 4 will also be read by scrubbing Dev 3.

  Thus all data stripes will be read twice, causing slow down in RAID56
  scrubbing.

- No proper progress report for P/Q stripes
  The scrub_progress has no member for P/Q error reporting at all.

  Thus even if we fixed some P/Q error, it will not be reported at all.

To address the above problems, this patchset will introduce a new
family of ioctl, scrub_fs ioctls.

[CORE DESIGN]
The new scrub_fs ioctl will go block group by block group, to scrub the
full fs.

Inside each block group, we go BTRFS_STRIPE_LEN as one scrub unit (will
be enlarged later to improve parallel for RAID0/10).

Then we read the full BTRFS_STRIPE_LEN bytes from each mirror (if there
is an extent inside the range).

The read bios will be submitted to each device at once, so we can still
take advantage of parallel IOs.

But the verification part still only happens inside the scrub thread, no
parallel csum check.

Also this ioctl family will rely on a much larger progress structure,
it's padded to 256 bytes, with parity specific error reporting (not yet
implemented though).

[THE GOOD]
- Every stripe will be iterated at most once
  No double read for data stripes.

- Better error reports for parity mismatch

- No need for complex bio form shaping
  Since we already submit read bios in BTRFS_STRIPE_LEN unit, and wait
  for them to finish, there are only at most nr_copies bios at fly.
  (For later RAID0/10 optimization, it will be nr_stripes)

  This behavior will reduce the IOPS usage by nature, thus no need to
  do any bio form shaping.

  This greatly reduce the code size, just check how much code are spent
  for bio form shaping in the old scrub code.

- Less block groups marked for read-only
  Now there is at most one block group marked read-only for scrub,
  reducing the possibility of ENOSPC during scrub.

- Can check NODATASUM data between different copies
  Now we can compare NODATASUM data between different copies, to
  determine if they match.

[THE BAD]
- Slower for SINGLE profile
  If some one is using SINGLE profile on multiple devices, scrub_fs will
  slower.

  Dev 1:   | SINGLE BG 1 |
  Dev 2:   | SINGLE BG 2 |
  Dev 3:   | SINGLE BG 3 |

  The existing scrub code will scrub single BG 1~3 at the same time.
  But the new scrub_fs will scrub single BG 1 first, then 2, then 3.
  Causing much slower scrub for such case.

  Although I'd argue, for above case, the user should go RAID0 anyway.

[THE UGLY]
Since this is just a proof-of-concept patchset, it lacks the following
functionality/optimization:

- Slower RAID0/RAID10 scrub.
  Since we only scrub BTRFS_STRIPE_LEN, it will not utilize all devices
  from RAID0/10.
  Although it can be easily enhanced by enlarging the scrub unit to a
  full stripe.

- No RAID56 support
  Ironically.

- Very basic btrfs-progs support
  Really only calls the ioctl and gives an output.
  No background scrub or scrub status file support.

- No drop-in full fstests run yet


Qu Wenruo (9):
  btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  btrfs: scrub: introduce place holder for btrfs_scrub_fs()
  btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs()
  btrfs: scrub: introduce place holder helper scrub_fs_block_group()
  btrfs: scrub: add helpers to fulfill csum/extent_generation
  btrfs: scrub: submit and wait for the read of each copy
  btrfs: scrub: implement metadata verification code for scrub_fs
  btrfs: scrub: implement data verification code for scrub_fs
  btrfs: scrub: implement recoverable sectors report for scrub_fs

 fs/btrfs/ctree.h           |    4 +
 fs/btrfs/disk-io.c         |   83 ++-
 fs/btrfs/disk-io.h         |    2 +
 fs/btrfs/ioctl.c           |   45 ++
 fs/btrfs/scrub.c           | 1424 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  173 +++++
 6 files changed, 1705 insertions(+), 26 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03  9:25   ` Wang Yugui
                     ` (2 more replies)
  2022-09-03  8:19 ` [PATCH PoC 2/9] btrfs: scrub: introduce place holder for btrfs_scrub_fs() Qu Wenruo
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

The new ioctls are to address the disadvantages of the existing
btrfs_scrub_dev():

a One thread per-device
  This can cause multiple block groups to be marked read-only for scrub,
  reducing available space temporarily.

  This also causes higher CPU/IO usage.
  For scrub, we should use the minimal amount of CPU and cause less
  IO when possible.

b Extra IO for RAID56
  For data stripes, we will cause at least 2x IO if we run "btrfs scrub
  start <mnt>".
  1x from scrubbing the device of data stripe.
  The other 1x from scrubbing the parity stripe.

  This duplicated IO should definitely be avoided

c Bad progress report for RAID56
  We can not report any repaired P/Q bytes at all.

The a and b will be addressed by the new one thread per-fs
btrfs_scrub_fs ioctl.

While c will be addressed by the new btrfs_scrub_fs_progress structure,
which has better comments and classification for all errors.

This patch is only a skeleton for the new family of ioctls, will return
-EOPNOTSUPP for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c           |   6 ++
 include/uapi/linux/btrfs.h | 173 +++++++++++++++++++++++++++++++++++++
 2 files changed, 179 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fe0cc816b4eb..3df3bcdf06eb 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5508,6 +5508,12 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_scrub_cancel(fs_info);
 	case BTRFS_IOC_SCRUB_PROGRESS:
 		return btrfs_ioctl_scrub_progress(fs_info, argp);
+	case BTRFS_IOC_SCRUB_FS:
+		return -EOPNOTSUPP;
+	case BTRFS_IOC_SCRUB_FS_CANCEL:
+		return -EOPNOTSUPP;
+	case BTRFS_IOC_SCRUB_FS_PROGRESS:
+		return -EOPNOTSUPP;
 	case BTRFS_IOC_BALANCE_V2:
 		return btrfs_ioctl_balance(file, argp);
 	case BTRFS_IOC_BALANCE_CTL:
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 7ada84e4a3ed..795ed33843ce 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -191,6 +191,174 @@ struct btrfs_ioctl_scrub_args {
 	__u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8];
 };
 
+struct btrfs_scrub_fs_progress {
+	/*
+	 * Fatal errors, including -ENOMEM, or csum/extent tree search errors.
+	 *
+	 * Normally after hitting such fatal errors, we error out, thus later
+	 * accounting will no longer be reliable.
+	 */
+	__u16	nr_fatal_errors;
+
+	/*
+	 * All super errors, from invalid members and IO error all go into
+	 * nr_super_errors.
+	 */
+	__u16	nr_super_errors;
+
+	/* Super block accounting. */
+	__u16	nr_super_scrubbed;
+	__u16	nr_super_repaired;
+
+	/*
+	 * Data accounting in bytes.
+	 *
+	 * We only care about how many bytes we scrubbed, thus no
+	 * accounting for number of extents.
+	 *
+	 * This accounting includes the extra mirrors.
+	 * E.g. for RAID1, one 16KiB extent will cause 32KiB in @data_scrubbed.
+	 */
+	__u64	data_scrubbed;
+
+	/* How many bytes can be recovered. */
+	__u64	data_recoverable;
+
+	/*
+	 * How many bytes are uncertain, this can only happen for NODATASUM
+	 * cases.
+	 * Including NODATASUM, and no extra mirror/parity to verify.
+	 * Or has extra mirrors, but they mismatch with each other.
+	 */
+	__u64	data_nocsum_uncertain;
+
+	/*
+	 * For data error bytes, these means determining errors, including:
+	 *
+	 * - IO failure, including missing dev.
+	 * - Data csum mismatch
+	 *   Csum tree search failure must go above case.
+	 */
+	__u64	data_io_fail;
+	__u64	data_csum_mismatch;
+
+	/*
+	 * All the unmentioned cases, including data matching its csum (of
+	 * course, implies IO suceeded) and data has no csum but matches all
+	 * other copies/parities, are the expected cases, no need to record.
+	 */
+
+	/*
+	 * Metadata accounting in bytes, pretty much the same as data.
+	 *
+	 * And since metadata has mandatory csum, there is no uncertain case.
+	 */
+	__u64	meta_scrubbed;
+	__u64	meta_recoverable;
+
+	/*
+	 * For meta, the checks are mostly progressive:
+	 *
+	 * - Unable to read
+	 *   @meta_io_fail
+	 *
+	 * - Unable to pass basic sanity checks (e.g. bytenr check)
+	 *   @meta_invalid
+	 *
+	 * - Pass basic sanity checks, but bad csum
+	 *   @meta_bad_csum
+	 *
+	 * - Pass basic checks and csum, but bad transid
+	 *   @meta_bad_transid
+	 *
+	 * - Pass all checks
+	 *   The expected case, no special accounting needed.
+	 */
+	__u64 meta_io_fail;
+	__u64 meta_invalid;
+	__u64 meta_bad_csum;
+	__u64 meta_bad_transid;
+
+	/*
+	 * Parity accounting.
+	 *
+	 * NOTE: for unused data sectors (but still contributes to P/Q
+	 * calculation, like the following case), they don't contribute
+	 * to any accounting.
+	 *
+	 * Data 1:	|<--- Unused ---->| <<<
+	 * Data 2:	|<- Data extent ->|
+	 * Parity:	|<--- Parity ---->|
+	 */
+	__u64 parity_scrubbed;
+	__u64 parity_recoverable;
+
+	/*
+	 * This happens when there is not enough info to determine if the
+	 * parity is correct, mostly happens when vertical stripes are
+	 * *all* NODATASUM sectors.
+	 *
+	 * If there is any sector with checksum in the vertical stripe,
+	 * parity itself will be no longer uncertain.
+	 */
+	__u64 parity_uncertain;
+
+	/*
+	 * For parity, the checks are progressive too:
+	 *
+	 * - Unable to read
+	 *   @parity_io_fail
+	 *
+	 * - Mismatch and any veritical data stripe has csum and
+	 *   the data stripe csum matches
+	 *   @parity_mismatch
+	 *   We want to repair the parity then.
+	 *
+	 * - Mismatch and veritical data stripe has csum, and data
+	 *   csum mismatch. And rebuilt data passes csum.
+	 *   This will go @data_recoverable or @data_csum_mismatch instead.
+	 *
+	 * - Mismatch but no veritical data stripe has csum
+	 *   @parity_uncertain
+	 *
+	 */
+	__u64 parity_io_fail;
+	__u64 parity_mismatch;
+
+	/* Padding to 256 bytes, and for later expansion. */
+	__u64 __unused[15];
+};
+static_assert(sizeof(struct btrfs_scrub_fs_progress) == 256);
+
+/*
+ * Readonly scrub fs will not try any repair (thus *_repaired member
+ * in scrub_fs_progress should always be 0).
+ */
+#define BTRFS_SCRUB_FS_FLAG_READONLY	(1ULL << 0)
+
+/*
+ * All supported flags.
+ *
+ * From the very beginning, scrub_fs ioctl would reject any unsupported
+ * flags, making later expansion much simper.
+ */
+#define BTRFS_SCRUB_FS_FLAG_SUPP	(BTRFS_SCRUB_FS_FLAG_READONLY)
+
+struct btrfs_ioctl_scrub_fs_args {
+	/* Input, logical bytenr to start the scrub */
+	__u64 start;
+
+	/* Input, the logical bytenr end (inclusive) */
+	__u64 end;
+
+	__u64 flags;
+	__u64 reserved[8];
+	struct btrfs_scrub_fs_progress progress; /* out */
+
+	/* pad to 1K */
+	__u8 unused[1024 - 24 - 64 - sizeof(struct btrfs_scrub_fs_progress)];
+};
+
 #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS	0
 #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID	1
 struct btrfs_ioctl_dev_replace_start_params {
@@ -1137,5 +1305,10 @@ enum btrfs_err_code {
 				    struct btrfs_ioctl_encoded_io_args)
 #define BTRFS_IOC_ENCODED_WRITE _IOW(BTRFS_IOCTL_MAGIC, 64, \
 				     struct btrfs_ioctl_encoded_io_args)
+#define BTRFS_IOC_SCRUB_FS	_IOWR(BTRFS_IOCTL_MAGIC, 65, \
+				      struct btrfs_ioctl_scrub_fs_args)
+#define BTRFS_IOC_SCRUB_FS_CANCEL _IO(BTRFS_IOCTL_MAGIC, 66)
+#define BTRFS_IOC_SCRUB_FS_PROGRESS _IOWR(BTRFS_IOCTL_MAGIC, 67, \
+				       struct btrfs_ioctl_scrub_fs_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH PoC 2/9] btrfs: scrub: introduce place holder for btrfs_scrub_fs()
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 3/9] btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs() Qu Wenruo
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

The new function btrfs_scrub_fs() will do the exclusive checking against
regular scrub and dev-replace, then return -EOPNOTSUPP as a place
holder.

Also to let regular scrub/dev-replace to be exclusive against
btrfs_scrub_fs(), also introduce btrfs_fs_info::scrub_fs_running member.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |   4 ++
 fs/btrfs/ioctl.c |  41 +++++++++++++++++-
 fs/btrfs/scrub.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3dc30f5e6fd0..0b360d9ec2e0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -955,6 +955,7 @@ struct btrfs_fs_info {
 	/* private scrub information */
 	struct mutex scrub_lock;
 	atomic_t scrubs_running;
+	atomic_t scrub_fs_running;
 	atomic_t scrub_pause_req;
 	atomic_t scrubs_paused;
 	atomic_t scrub_cancel_req;
@@ -4063,6 +4064,9 @@ int btrfs_should_ignore_reloc_root(struct btrfs_root *root);
 int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		    u64 end, struct btrfs_scrub_progress *progress,
 		    int readonly, int is_dev_replace);
+int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
+		   struct btrfs_scrub_fs_progress *progress,
+		   bool readonly);
 void btrfs_scrub_pause(struct btrfs_fs_info *fs_info);
 void btrfs_scrub_continue(struct btrfs_fs_info *fs_info);
 int btrfs_scrub_cancel(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 3df3bcdf06eb..8219e2554734 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4112,6 +4112,45 @@ static long btrfs_ioctl_scrub_progress(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static long btrfs_ioctl_scrub_fs(struct file *file, void __user *arg)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(file_inode(file)->i_sb);
+	struct btrfs_ioctl_scrub_fs_args *sfsa;
+	bool readonly = false;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	sfsa = memdup_user(arg, sizeof(*sfsa));
+	if (IS_ERR(sfsa))
+		return PTR_ERR(sfsa);
+
+	if (sfsa->flags & ~BTRFS_SCRUB_FS_FLAG_SUPP) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+	if (sfsa->flags & BTRFS_SCRUB_FS_FLAG_READONLY)
+		readonly = true;
+
+	if (!readonly) {
+		ret = mnt_want_write_file(file);
+		if (ret)
+			goto out;
+	}
+
+	ret = btrfs_scrub_fs(fs_info, sfsa->start, sfsa->end, &sfsa->progress,
+			     readonly);
+	if (copy_to_user(arg, sfsa, sizeof(*sfsa)))
+		ret = -EFAULT;
+
+	if (!readonly)
+		mnt_drop_write_file(file);
+out:
+	kfree(sfsa);
+	return ret;
+}
+
 static long btrfs_ioctl_get_dev_stats(struct btrfs_fs_info *fs_info,
 				      void __user *arg)
 {
@@ -5509,7 +5548,7 @@ long btrfs_ioctl(struct file *file, unsigned int
 	case BTRFS_IOC_SCRUB_PROGRESS:
 		return btrfs_ioctl_scrub_progress(fs_info, argp);
 	case BTRFS_IOC_SCRUB_FS:
-		return -EOPNOTSUPP;
+		return btrfs_ioctl_scrub_fs(file, argp);
 	case BTRFS_IOC_SCRUB_FS_CANCEL:
 		return -EOPNOTSUPP;
 	case BTRFS_IOC_SCRUB_FS_PROGRESS:
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 755273b77a3f..09a1ab6ac54e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4295,6 +4295,15 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	}
 
 	mutex_lock(&fs_info->scrub_lock);
+
+	/* Conflict with scrub_fs ioctls. */
+	if (atomic_read(&fs_info->scrub_fs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		ret = -EINPROGRESS;
+		goto out;
+	}
+
 	if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &dev->dev_state) ||
 	    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &dev->dev_state)) {
 		mutex_unlock(&fs_info->scrub_lock);
@@ -4416,6 +4425,102 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+/*
+ * Unlike btrfs_scrub_dev(), this function works completely in logical bytenr
+ * level, and has the following advantage:
+ *
+ * - Better error reporting
+ *   The new btrfs_scrub_fs_progress has better classified errors, more
+ *   members to include parity errors.
+ *
+ * - Always scrub one block group at one time
+ *   btrfs_scrub_dev() works by starting one scrub for each device.
+ *   This can cause asynchronised progress, and mark multiple block groups
+ *   RO, reducing the avaialbe space unnecessarily.
+ *
+ * - Less IO for RAID56
+ *   Instead of treating RAID56 data and P/Q stripes differently, here we only
+ *   scrub a full stripe at most once.
+ *   Instead of the 2x read for data stripes (one for scrubbing the data stripe itself,
+ *   the other one from scrubbing the P/Q stripe).
+ *
+ * - No bio formshaping and streamlined code
+ *   Always submit bio for all involved mirrors (or data/p/q stripes for
+ *   RAID56), wait for the IO, then run the check.
+ *
+ *   Thus there are at most nr_mirrors (nr_stripes for RAID56) bios on-the-fly,
+ *   and for each device, there is always at most one bio for scrub.
+ *
+ *   This would greatly simplify all involved code.
+ *
+ * - No need to support dev-replace
+ *   Thus we can have simpler code.
+ *
+ * Unfortunately this ioctl has the following disadvantage so far:
+ *
+ * - No resume after unmount
+ *   We may need extra on-disk format to save the progress.
+ *   Thus we may need a new RO compat flags for the resume ability.
+ *
+ * - Conflicts with dev-replace/scrub
+ *
+ * - Needs kernel support.
+ *
+ * - Not fully finished
+ */
+int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
+		   struct btrfs_scrub_fs_progress *progress,
+		   bool readonly)
+{
+	int ret;
+
+	if (btrfs_fs_closing(fs_info))
+		return -EAGAIN;
+
+	if (btrfs_is_zoned(fs_info))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Metadata and data unit should be able to be contained inside one
+	 * stripe.
+	 */
+	ASSERT(fs_info->nodesize <= BTRFS_STRIPE_LEN);
+	ASSERT(fs_info->sectorsize <= BTRFS_STRIPE_LEN);
+
+	mutex_lock(&fs_info->scrub_lock);
+	/* This function conflicts with scrub/dev-replace. */
+	if (atomic_read(&fs_info->scrubs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		return -EINPROGRESS;
+	}
+
+	/* And there can only be one running btrfs_scrub_fs(). */
+	if (atomic_read(&fs_info->scrub_fs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		return -EINPROGRESS;
+	}
+
+	__scrub_blocked_if_needed(fs_info);
+	atomic_inc(&fs_info->scrub_fs_running);
+
+	/* This is to allow existing scrub pause to be reused. */
+	atomic_inc(&fs_info->scrubs_running);
+	btrfs_info(fs_info, "scrub_fs: started");
+	mutex_unlock(&fs_info->scrub_lock);
+
+	/* Place holder for real workload. */
+	ret = -EOPNOTSUPP;
+
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_dec(&fs_info->scrubs_running);
+	atomic_dec(&fs_info->scrub_fs_running);
+	btrfs_info(fs_info, "scrub_fs: finished with status: %d", ret);
+	mutex_unlock(&fs_info->scrub_lock);
+	wake_up(&fs_info->scrub_pause_wait);
+
+	return ret;
+}
+
 void btrfs_scrub_pause(struct btrfs_fs_info *fs_info)
 {
 	mutex_lock(&fs_info->scrub_lock);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH PoC 3/9] btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs()
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 2/9] btrfs: scrub: introduce place holder for btrfs_scrub_fs() Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 4/9] btrfs: scrub: introduce place holder helper scrub_fs_block_group() Qu Wenruo
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

This new helper is mostly the same as scrub_enumerate_chunks(), but with
some small changes:

- No need for dev-replace branches

- No need to search dev-extent tree
  We can directly iterate the block groups.

The new helper currently will only iterate all the bgs, but doing
nothing for the iterated bgs.

Also one smaller helper is introduced:

- scrub_fs_alloc_ctx()
  To allocate a scrub_fs_ctx, which has way less members (for now and
  for the future) compared to scrub_ctx.

  The scrub_fs_ctx will have a very defined lifespan (only inside
  btrfs_scrub_fs(), and can only have one scrub_fs_ctx, thus not need to
  be ref counted)

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 164 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 162 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 09a1ab6ac54e..cf4dc384427e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -198,6 +198,24 @@ struct scrub_ctx {
 	refcount_t              refs;
 };
 
+/* This structure should only has a lifespan inside btrfs_scrub_fs(). */
+struct scrub_fs_ctx {
+	struct btrfs_fs_info		*fs_info;
+
+	/* Current block group we're scurbbing. */
+	struct btrfs_block_group	*cur_bg;
+
+	/* Current logical bytenr being scrubbed. */
+	u64				cur_logical;
+
+	atomic_t			sectors_under_io;
+
+	bool				readonly;
+
+	/* There will and only be one thread touching @stat. */
+	struct btrfs_scrub_fs_progress	stat;
+};
+
 struct scrub_warning {
 	struct btrfs_path	*path;
 	u64			extent_item_size;
@@ -4425,6 +4443,126 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+static struct scrub_fs_ctx *scrub_fs_alloc_ctx(struct btrfs_fs_info *fs_info,
+					       bool readonly)
+{
+	struct scrub_fs_ctx *sfctx;
+	int ret;
+
+	sfctx = kzalloc(sizeof(*sfctx), GFP_KERNEL);
+	if (!sfctx) {
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	sfctx->fs_info = fs_info;
+	sfctx->readonly = readonly;
+	atomic_set(&sfctx->sectors_under_io, 0);
+	return sfctx;
+error:
+	kfree(sfctx);
+	return ERR_PTR(ret);
+}
+
+static int scrub_fs_iterate_bgs(struct scrub_fs_ctx *sfctx, u64 start, u64 end)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	u64 cur = start;
+	int ret;
+
+	while (cur < end) {
+		struct btrfs_block_group *bg;
+		bool ro_set = false;
+
+		bg = btrfs_lookup_first_block_group(fs_info, cur);
+		if (!bg)
+			break;
+		if (bg->start + bg->length >= end) {
+			btrfs_put_block_group(bg);
+			break;
+		}
+		spin_lock(&bg->lock);
+
+		/* Already deleted bg, skip to the next one. */
+		if (test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags)) {
+			spin_unlock(&bg->lock);
+			cur = bg->start + bg->length;
+			btrfs_put_block_group(bg);
+			continue;
+		}
+		btrfs_freeze_block_group(bg);
+		spin_unlock(&bg->lock);
+
+		/*
+		 * we need call btrfs_inc_block_group_ro() with scrubs_paused,
+		 * to avoid deadlock caused by:
+		 * btrfs_inc_block_group_ro()
+		 * -> btrfs_wait_for_commit()
+		 * -> btrfs_commit_transaction()
+		 * -> btrfs_scrub_pause()
+		 */
+		scrub_pause_on(fs_info);
+
+		/*
+		 * Check the comments before btrfs_inc_block_group_ro() inside
+		 * scrub_enumerate_chunks() for reasons.
+		 */
+		ret = btrfs_inc_block_group_ro(bg, false);
+		if (ret == 0)
+			ro_set = true;
+		if (ret == -ETXTBSY) {
+			btrfs_warn(fs_info,
+		   "skipping scrub of block group %llu due to active swapfile",
+				   bg->start);
+			scrub_pause_off(fs_info);
+			ret = 0;
+			goto next;
+		}
+		if (ret < 0 && ret != -ENOSPC) {
+			btrfs_warn(fs_info,
+				   "failed setting block group ro: %d", ret);
+			scrub_pause_off(fs_info);
+			goto next;
+		}
+
+		scrub_pause_off(fs_info);
+
+		/* Place holder for the real chunk scrubbing code. */
+		ret = 0;
+
+		if (ro_set)
+			btrfs_dec_block_group_ro(bg);
+
+		/*
+		 * We might have prevented the cleaner kthread from deleting
+		 * this block group if it was already unused because we raced
+		 * and set it to RO mode first. So add it back to the unused
+		 * list, otherwise it might not ever be deleted unless a manual
+		 * balance is triggered or it becomes used and unused again.
+		 */
+		spin_lock(&bg->lock);
+		if (!test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags) &&
+		    !bg->ro && bg->reserved == 0 && bg->used == 0) {
+			spin_unlock(&bg->lock);
+			if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+				btrfs_discard_queue_work(&fs_info->discard_ctl,
+							 bg);
+			else
+				btrfs_mark_bg_unused(bg);
+		} else {
+			spin_unlock(&bg->lock);
+		}
+next:
+		cur = bg->start + bg->length;
+
+		btrfs_unfreeze_block_group(bg);
+		btrfs_put_block_group(bg);
+		if (ret)
+			break;
+	}
+	return ret;
+}
+
 /*
  * Unlike btrfs_scrub_dev(), this function works completely in logical bytenr
  * level, and has the following advantage:
@@ -4472,6 +4610,8 @@ int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
 		   struct btrfs_scrub_fs_progress *progress,
 		   bool readonly)
 {
+	struct scrub_fs_ctx *sfctx;
+	unsigned int nofs_flag;
 	int ret;
 
 	if (btrfs_fs_closing(fs_info))
@@ -4508,8 +4648,25 @@ int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
 	btrfs_info(fs_info, "scrub_fs: started");
 	mutex_unlock(&fs_info->scrub_lock);
 
-	/* Place holder for real workload. */
-	ret = -EOPNOTSUPP;
+	sfctx = scrub_fs_alloc_ctx(fs_info, readonly);
+	if (IS_ERR(sfctx)) {
+		ret = PTR_ERR(sfctx);
+		sfctx = NULL;
+		goto out;
+	}
+
+	if (progress)
+		memcpy(&sfctx->stat, progress, sizeof(*progress));
+
+	/*
+	 * Check the comments before memalloc_nofs_save() in btrfs_scrub_dev()
+	 * for reasons.
+	 */
+	nofs_flag = memalloc_nofs_save();
+	ret = scrub_fs_iterate_bgs(sfctx, start, end);
+	memalloc_nofs_restore(nofs_flag);
+out:
+	kfree(sfctx);
 
 	mutex_lock(&fs_info->scrub_lock);
 	atomic_dec(&fs_info->scrubs_running);
@@ -4518,6 +4675,9 @@ int btrfs_scrub_fs(struct btrfs_fs_info *fs_info, u64 start, u64 end,
 	mutex_unlock(&fs_info->scrub_lock);
 	wake_up(&fs_info->scrub_pause_wait);
 
+	if (progress)
+		memcpy(progress, &sfctx->stat, sizeof(*progress));
+
 	return ret;
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH PoC 4/9] btrfs: scrub: introduce place holder helper scrub_fs_block_group()
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-09-03  8:19 ` [PATCH PoC 3/9] btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs() Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 5/9] btrfs: scrub: add helpers to fulfill csum/extent_generation Qu Wenruo
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

The main place holder helper scrub_fs_block_group() will:

- Initialize various needed members inside scrub_fs_ctx
  This includes:
  * Calculate the nr_copies for non-RAID56 profiles, or grab nr_stripes
    for RAID56 profiles.
  * Allocate memory for sectors/pages array, and csum_buf if it's data
    bg.
  * Initialize all sectors to type UNUSED.

  All these above memory will stay for each stripe we run, thus we only
  need to allocate these memories once-per-bg.

- Iterate stripes containing any used sector
  This is the code to be implemented.

- Cleanup above memories before we finish the block group.

The real work of scrubbing a stripe is not yet implemented.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 233 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cf4dc384427e..46885f966bb3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -198,6 +198,45 @@ struct scrub_ctx {
 	refcount_t              refs;
 };
 
+#define SCRUB_FS_SECTOR_FLAG_UNUSED		(1 << 0)
+#define SCRUB_FS_SECTOR_FLAG_DATA		(1 << 1)
+#define SCRUB_FS_SECTOR_FLAG_META		(1 << 2)
+#define SCRUB_FS_SECTOR_FLAG_PARITY		(1 << 3)
+
+/*
+ * Represent a sector.
+ *
+ * To access the content of a sector, the caller should have the index inside
+ * the scrub_fs_ctx->sectors[] array, and use that index to calculate the page
+ * and page offset innside scrub_fs_ctx->pages[] array.
+ *
+ * To get the logical/physical bytenr of the a sector, the caller should use
+ * scrub_fs_ctx->bioc and the sector index to calclulate the logical/physical
+ * bytenr.
+ */
+struct scrub_fs_sector {
+	unsigned int flags;
+	union {
+		/*
+		 * For SCRUB_FS_SECTOR_TYPE_DATA, either it points to some byte
+		 * inside scrub_fs_ctx->csum_buf, or it's NULL for NODATACSUM
+		 * case.
+		 */
+		u8 *csum;
+
+		/*
+		 * For SECRUB_FS_SECTOR_TYPE_META, this records the generation
+		 * and the logical bytenr of the tree block.
+		 * (So we can grab the first sector to calculate their inline
+		 * csum).
+		 */
+		struct {
+			u64 eb_logical;
+			u64 eb_generation;
+		};
+	};
+};
+
 /* This structure should only has a lifespan inside btrfs_scrub_fs(). */
 struct scrub_fs_ctx {
 	struct btrfs_fs_info		*fs_info;
@@ -208,12 +247,64 @@ struct scrub_fs_ctx {
 	/* Current logical bytenr being scrubbed. */
 	u64				cur_logical;
 
+
 	atomic_t			sectors_under_io;
 
 	bool				readonly;
 
 	/* There will and only be one thread touching @stat. */
 	struct btrfs_scrub_fs_progress	stat;
+
+	/*
+	 * How many sectors we read per stripe.
+	 *
+	 * For now, it's fixed to BTRFS_STRIPE_LEN / sectorsize.
+	 *
+	 * This can be enlarged to full stripe size / sectorsize
+	 * for later RAID0/10/5/6 code.
+	 */
+	int				sectors_per_stripe;
+	/*
+	 * For non-RAID56 profiles, we only care how many copies the block
+	 * group has.
+	 * For RAID56 profiles, we care how many stripes the block group
+	 * has (including data and parities).
+	 */
+	union {
+		int			nr_stripes;
+		int			nr_copies;
+	};
+
+	/*
+	 * The total number of sectors we scrub in one run (including
+	 * the extra mirrors/parities).
+	 *
+	 * For non-RAID56 profiles, it would be:
+	 *   nr_copie * (BTRFS_STRIPE_LEN / sectorsize).
+	 *
+	 * For RAID56 profiles, it would be:
+	 *   nr_stripes * (BTRFS_STRIPE_LEN / sectorsize).
+	 */
+	int				total_sectors;
+
+	/* Page array for above total_sectors. */
+	struct page			**pages;
+
+	/*
+	 * Sector array for above total_sectors. The page content will be
+	 * inside above pages array.
+	 *
+	 * Both array should be initialized when start to scrub a block group.
+	 */
+	struct scrub_fs_sector		*sectors;
+
+	/*
+	 * Csum buffer allocated for the stripe.
+	 *
+	 * All sectors in different mirrors for the same logical bytenr
+	 * would point to the same location inside the buffer.
+	 */
+	u8				*csum_buf;
 };
 
 struct scrub_warning {
@@ -4464,6 +4555,147 @@ static struct scrub_fs_ctx *scrub_fs_alloc_ctx(struct btrfs_fs_info *fs_info,
 	return ERR_PTR(ret);
 }
 
+/*
+ * Cleanup the memory allocation, mostly after finishing a bg, or for error
+ * path.
+ */
+static void scrub_fs_cleanup_for_bg(struct scrub_fs_ctx *sfctx)
+{
+	int i;
+	const int nr_pages = sfctx->nr_copies * (BTRFS_STRIPE_LEN >> PAGE_SHIFT);
+
+	if (sfctx->pages) {
+		for (i = 0; i < nr_pages; i++) {
+			if (sfctx->pages[i]) {
+				__free_page(sfctx->pages[i]);
+				sfctx->pages[i] = NULL;
+			}
+		}
+	}
+	kfree(sfctx->pages);
+	sfctx->pages = NULL;
+
+	kfree(sfctx->sectors);
+	sfctx->sectors = NULL;
+
+	kfree(sfctx->csum_buf);
+	sfctx->csum_buf = NULL;
+
+	/* NOTE: block group will only be put inside scrub_fs_iterate_bgs(). */
+	sfctx->cur_bg = NULL;
+}
+
+/* Do the block group specific initialization. */
+static int scrub_fs_init_for_bg(struct scrub_fs_ctx *sfctx,
+				struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct extent_map_tree *map_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	bool is_raid56 = !!(bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK);
+	int ret = 0;
+	int nr_pages;
+	int i;
+
+	/*
+	 * One stripe should be page aligned, aka, PAGE_SIZE should not be
+	 * larger than 64K.
+	 */
+	ASSERT(IS_ALIGNED(BTRFS_STRIPE_LEN, PAGE_SIZE));
+
+	/* Last run should have cleanedup all the memories. */
+	ASSERT(!sfctx->cur_bg);
+	ASSERT(!sfctx->pages);
+	ASSERT(!sfctx->sectors);
+	ASSERT(!sfctx->csum_buf);
+
+	read_lock(&map_tree->lock);
+	em = lookup_extent_mapping(map_tree, bg->start, bg->length);
+	read_unlock(&map_tree->lock);
+
+	/*
+	 * Might have been an unused block group deleted by the cleaner
+	 * kthread or relocation.
+	 */
+	if (!em) {
+		spin_lock(&bg->lock);
+		if (!test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags))
+			ret = -EINVAL;
+		spin_unlock(&bg->lock);
+		return ret;
+	}
+	/*
+	 * Since we're ensured to be executed without any other
+	 * dev-replace/scrub running, the num_stripes should be the total
+	 * number of stripes, without the replace target device.
+	 */
+	if (is_raid56)
+		sfctx->nr_stripes = em->map_lookup->num_stripes;
+	free_extent_map(em);
+
+	if (!is_raid56)
+		sfctx->nr_copies = btrfs_num_copies(fs_info, bg->start,
+						    fs_info->sectorsize);
+	sfctx->sectors_per_stripe = BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits;
+	sfctx->total_sectors = sfctx->sectors_per_stripe * sfctx->nr_copies;
+
+	nr_pages = (BTRFS_STRIPE_LEN >> PAGE_SHIFT) * sfctx->nr_copies;
+
+	sfctx->pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	if (!sfctx->pages)
+		goto enomem;
+
+	for (i = 0; i < nr_pages; i++) {
+		sfctx->pages[i] = alloc_page(GFP_KERNEL);
+		if (!sfctx->pages[i])
+			goto enomem;
+	}
+
+	sfctx->sectors = kcalloc(sfctx->total_sectors,
+				 sizeof(struct scrub_fs_sector), GFP_KERNEL);
+	if (!sfctx->sectors)
+		goto enomem;
+
+	for (i = 0; i < sfctx->total_sectors; i++)
+		sfctx->sectors[i].flags = SCRUB_FS_SECTOR_FLAG_UNUSED;
+
+	if (bg->flags & BTRFS_BLOCK_GROUP_DATA) {
+		sfctx->csum_buf = kzalloc(fs_info->csum_size *
+					  sfctx->sectors_per_stripe, GFP_KERNEL);
+		if (!sfctx->csum_buf)
+			goto enomem;
+	}
+	sfctx->cur_bg = bg;
+	sfctx->cur_logical = bg->start;
+	return 0;
+
+enomem:
+	sfctx->stat.nr_fatal_errors++;
+	scrub_fs_cleanup_for_bg(sfctx);
+	return -ENOMEM;
+}
+
+
+static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
+				struct btrfs_block_group *bg)
+{
+	int ret;
+
+	/* Not yet supported, just skip RAID56 bgs for now. */
+	if (bg->flags & BTRFS_BLOCK_GROUP_RAID56_MASK)
+		return 0;
+
+	ret = scrub_fs_init_for_bg(sfctx, bg);
+	if (ret < 0)
+		return ret;
+
+	/* Place holder for the loop itearting the sectors. */
+	ret = 0;
+
+	scrub_fs_cleanup_for_bg(sfctx);
+	return ret;
+}
+
 static int scrub_fs_iterate_bgs(struct scrub_fs_ctx *sfctx, u64 start, u64 end)
 {
 	struct btrfs_fs_info *fs_info = sfctx->fs_info;
@@ -4527,8 +4759,7 @@ static int scrub_fs_iterate_bgs(struct scrub_fs_ctx *sfctx, u64 start, u64 end)
 
 		scrub_pause_off(fs_info);
 
-		/* Place holder for the real chunk scrubbing code. */
-		ret = 0;
+		ret = scrub_fs_block_group(sfctx, bg);
 
 		if (ro_set)
 			btrfs_dec_block_group_ro(bg);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH PoC 5/9] btrfs: scrub: add helpers to fulfill csum/extent_generation
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-09-03  8:19 ` [PATCH PoC 4/9] btrfs: scrub: introduce place holder helper scrub_fs_block_group() Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03 12:19   ` kernel test robot
  2022-09-03  8:19 ` [PATCH PoC 6/9] btrfs: scrub: submit and wait for the read of each copy Qu Wenruo
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

This patch will introduce two new major helpers:

- scrub_fs_locate_and_fill_stripe()
  This will find a stripe which contains any extent.
  And then fill corresponding sectors inside sectors[] array with its
  extent_type.

  If it's a metadata extent, it will also fill eb_generation member.

- scrub_fs_fill_stripe_csum()
  This is for data block groups only.

  This helper will find all csums for the stripe, and copy the csum into
  the corresponding position inside scrub_fs_ctx->csum_buf.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 303 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 301 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 46885f966bb3..96daed3f3274 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4534,6 +4534,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	return ret;
 }
 
+static struct scrub_fs_sector *scrub_fs_get_sector(struct scrub_fs_ctx *sfctx,
+						   int sector_nr, int mirror_nr)
+{
+	/* Basic boudary checks. */
+	ASSERT(sector_nr >= 0 && sector_nr < sfctx->sectors_per_stripe);
+	ASSERT(mirror_nr >= 0 && mirror_nr < sfctx->nr_copies);
+
+	return &sfctx->sectors[mirror_nr * sfctx->sectors_per_stripe + sector_nr];
+}
+
 static struct scrub_fs_ctx *scrub_fs_alloc_ctx(struct btrfs_fs_info *fs_info,
 					       bool readonly)
 {
@@ -4675,10 +4685,264 @@ static int scrub_fs_init_for_bg(struct scrub_fs_ctx *sfctx,
 	return -ENOMEM;
 }
 
+static int scrub_fs_fill_sector_types(struct scrub_fs_ctx *sfctx,
+				      u64 stripe_start, u64 extent_start,
+				      u64 extent_len, u64 extent_flags,
+				      u64 extent_gen)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const u64 stripe_end = stripe_start + (sfctx->sectors_per_stripe <<
+					       fs_info->sectorsize_bits);
+	const u64 real_start = max(stripe_start, extent_start);
+	const u64 real_len = min(stripe_end, extent_start + extent_len) - real_start;
+	bool is_meta = false;
+	u64 cur_logical;
+	int sector_flags;
+
+	if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+		sector_flags = SCRUB_FS_SECTOR_FLAG_META;
+		is_meta = true;
+		/* Metadata should never corss stripe boundary. */
+		if (extent_start != real_start) {
+			btrfs_err(fs_info,
+				"tree block at bytenr %llu crossed stripe boundary",
+				extent_start);
+			return -EUCLEAN;
+		}
+	} else {
+		sector_flags = SCRUB_FS_SECTOR_FLAG_DATA;
+	}
+
+	for (cur_logical = real_start; cur_logical < real_start + real_len;
+	     cur_logical += fs_info->sectorsize) {
+		const int sector_nr = (cur_logical - stripe_start) >>
+				       fs_info->sectorsize_bits;
+		int mirror_nr;
+
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+			struct scrub_fs_sector *sector =
+				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+			/*
+			 * All sectors in the range should have not been
+			 * initialized.
+			 */
+			ASSERT(sector->flags == SCRUB_FS_SECTOR_FLAG_UNUSED);
+			ASSERT(sector->csum == NULL);
+			ASSERT(sector->eb_generation == 0);
+
+			sector->flags = sector_flags;
+			/*
+			 * Here we only populate eb_*, for csum it will be later
+			 * filled in a dedicated csum tree search.
+			 */
+			if (is_meta) {
+				sector->eb_logical = extent_start;
+				sector->eb_generation = extent_gen;
+			}
+		}
+	}
+	return 0;
+}
+
+/*
+ * To locate a stripe where there is any extent inside it.
+ *
+ * @start:	logical bytenr to start the search. Result stripe should
+ *		be >= @start.
+ * @found_ret:	logical bytenr of the found stripe. Should also be a stripe start
+ *		bytenr.
+ *
+ * Return 0 if we found such stripe, and update @found_ret, furthermore, we will
+ * fill sfctx->stripes[] array with the needed extent info (generation for tree
+ * block, csum for data extents).
+ *
+ * Return <0 if we hit fatal errors.
+ *
+ * Return >0 if there is no more stripe containing any extent after @start.
+ */
+static int scrub_fs_locate_and_fill_stripe(struct scrub_fs_ctx *sfctx, u64 start,
+					   u64 *found_ret)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct btrfs_path path = {0};
+	struct btrfs_root *extent_root = btrfs_extent_root(fs_info,
+							   sfctx->cur_bg->start);
+	const u64 bg_start = sfctx->cur_bg->start;
+	const u64 bg_end = bg_start + sfctx->cur_bg->length;
+	const u32 stripe_len = sfctx->sectors_per_stripe << fs_info->sectorsize_bits;
+	u64 cur_logical = start;
+	/*
+	 * The full stripe start we found. If 0, it means we haven't yet found
+	 * any extent.
+	 */
+	u64 stripe_start = 0;
+	u64 extent_start;
+	u64 extent_size;
+	u64 extent_flags;
+	u64 extent_gen;
+	int ret;
+
+	path.search_commit_root = true;
+	path.skip_locking = true;
+
+	/* Initial search to find any extent inside the block group. */
+	ret = find_first_extent_item(extent_root, &path, cur_logical,
+				     bg_end - cur_logical);
+	/* Either error out or no more extent items. */
+	if (ret)
+		goto out;
+
+	get_extent_info(&path, &extent_start, &extent_size, &extent_flags,
+			&extent_gen);
+	/*
+	 * Note here a full stripe for RAID56 may not be power of 2, thus
+	 * we have to use rounddown(), not round_down().
+	 */
+	stripe_start = rounddown(max(extent_start, cur_logical) - bg_start,
+				 stripe_len) + bg_start;
+	*found_ret = stripe_start;
+
+	scrub_fs_fill_sector_types(sfctx, stripe_start, extent_start,
+				   extent_size, extent_flags, extent_gen);
+
+	cur_logical = min(stripe_start + stripe_len, extent_start + extent_size);
+
+	/* Now iterate all the remaining extents inside the stripe. */
+	while (cur_logical < stripe_start + stripe_len) {
+		ret = find_first_extent_item(extent_root, &path, cur_logical,
+				stripe_start + stripe_len - cur_logical);
+		if (ret)
+			goto out;
+
+		get_extent_info(&path, &extent_start, &extent_size,
+				&extent_flags, &extent_gen);
+		scrub_fs_fill_sector_types(sfctx, stripe_start, extent_start,
+					   extent_size, extent_flags, extent_gen);
+		cur_logical = extent_start + extent_size;
+	}
+out:
+	btrfs_release_path(&path);
+	/*
+	 * Found nothing, the first get_extent_info() returned error or no
+	 * extent found at all, just return @ret directly.
+	 */
+	if (!stripe_start)
+		return ret;
+
+	/*
+	 * Now we have hit at least one extent, if ret > 0, then it means
+	 * we still need to handle the extents we found, in that case we
+	 * return 0, so we will scrub what we found.
+	 */
+	if (ret > 0)
+		ret = 0;
+	return ret;
+}
+
+static void scrub_fs_fill_one_ordered_sum(struct scrub_fs_ctx *sfctx,
+					  struct btrfs_ordered_sum *sum)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const u64 stripe_start = sfctx->cur_logical;
+	const u32 stripe_len = sfctx->sectors_per_stripe <<
+			       fs_info->sectorsize_bits;
+	u64 cur;
+
+	ASSERT(stripe_start <= sum->bytenr &&
+	       sum->bytenr + sum->len <= stripe_start + stripe_len);
+
+	for (cur = sum->bytenr; cur < sum->bytenr + sum->len;
+	     cur += fs_info->sectorsize) {
+		int sector_nr = (cur - stripe_start) >> fs_info->sectorsize_bits;
+		int mirror_nr;
+		u8 *csum = sum->sums + (((cur - sum->bytenr) >>
+					fs_info->sectorsize_bits) * fs_info->csum_size);
+
+		/* Fill csum_buf first. */
+		memcpy(sfctx->csum_buf + sector_nr * fs_info->csum_size,
+		       csum, fs_info->csum_size);
+
+		/* Make sectors in all mirrors to point to the correct csum. */
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+			struct scrub_fs_sector *sector =
+				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+			ASSERT(sector->flags & SCRUB_FS_SECTOR_FLAG_DATA);
+			sector->csum = sfctx->csum_buf + sector_nr * fs_info->csum_size;
+		}
+	}
+}
+
+static int scrub_fs_fill_stripe_csum(struct scrub_fs_ctx *sfctx)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct btrfs_root *csum_root = btrfs_csum_root(fs_info,
+						       sfctx->cur_bg->start);
+	const u64 stripe_start = sfctx->cur_logical;
+	const u32 stripe_len = sfctx->sectors_per_stripe << fs_info->sectorsize_bits;
+	LIST_HEAD(csum_list);
+	int ret;
+
+	ret = btrfs_lookup_csums_range(csum_root, stripe_start,
+				       stripe_start + stripe_len - 1,
+				       &csum_list, true);
+	if (ret < 0)
+		return ret;
+
+	/* Extract csum_list and fill them into csum_buf. */
+	while (!list_empty(&csum_list)) {
+		struct btrfs_ordered_sum *sum;
+
+		sum = list_first_entry(&csum_list, struct btrfs_ordered_sum,
+				       list);
+		scrub_fs_fill_one_ordered_sum(sfctx, sum);
+		list_del(&sum->list);
+		kfree(sum);
+	}
+	return 0;
+}
+
+/*
+ * Reset the content of pages/csum_buf and reset sector types/csum, so
+ * no leftover data for the next run.
+ */
+static void scrub_fs_reset_stripe(struct scrub_fs_ctx *sfctx)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const int nr_pages = (sfctx->total_sectors <<
+			      fs_info->sectorsize_bits) >> PAGE_SHIFT;
+	int i;
+
+	ASSERT(nr_pages);
+
+	/* Zero page content. */
+	for (i = 0; i < nr_pages; i++)
+		memzero_page(sfctx->pages[i], 0, PAGE_SIZE);
+
+	/* Zero csum_buf. */
+	if (sfctx->csum_buf)
+		memset(sfctx->csum_buf, 0, sfctx->sectors_per_stripe *
+		       fs_info->csum_size);
+
+	/* Clear sector types and its csum pointer. */
+	for (i = 0; i < sfctx->total_sectors; i++) {
+		struct scrub_fs_sector *sector = &sfctx->sectors[i];
+
+		sector->flags = SCRUB_FS_SECTOR_FLAG_UNUSED;
+		sector->csum = NULL;
+		sector->eb_generation = 0;
+		sector->eb_logical = 0;
+	}
+}
 
 static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
 				struct btrfs_block_group *bg)
 {
+	const struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	bool is_data = bg->flags & BTRFS_BLOCK_GROUP_DATA;
+	u32 stripe_len;
+	u64 cur_logical = bg->start;
 	int ret;
 
 	/* Not yet supported, just skip RAID56 bgs for now. */
@@ -4689,8 +4953,43 @@ static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
 	if (ret < 0)
 		return ret;
 
-	/* Place holder for the loop itearting the sectors. */
-	ret = 0;
+	/*
+	 * We can only trust anything inside sfctx after
+	 * scrub_fs_init_for_bg().
+	 */
+	stripe_len = sfctx->sectors_per_stripe << fs_info->sectorsize_bits;
+	ASSERT(stripe_len);
+
+	while (cur_logical < bg->start + bg->length) {
+		u64 stripe_start;
+
+		ret = scrub_fs_locate_and_fill_stripe(sfctx, cur_logical,
+						      &stripe_start);
+		if (ret < 0)
+			break;
+
+		/* No more extent left in the bg, we have finished the bg. */
+		if (ret > 0) {
+			ret = 0;
+			break;
+		}
+
+		sfctx->cur_logical = stripe_start;
+
+		if (is_data) {
+			ret = scrub_fs_fill_stripe_csum(sfctx);
+			if (ret < 0)
+				break;
+		}
+
+		/* Place holder for real stripe scrubbing. */
+		ret = 0;
+
+		/* Reset the stripe for next run. */
+		scrub_fs_reset_stripe(sfctx);
+
+		cur_logical = stripe_start + stripe_len;
+	}
 
 	scrub_fs_cleanup_for_bg(sfctx);
 	return ret;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH PoC 6/9] btrfs: scrub: submit and wait for the read of each copy
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-09-03  8:19 ` [PATCH PoC 5/9] btrfs: scrub: add helpers to fulfill csum/extent_generation Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 7/9] btrfs: scrub: implement metadata verification code for scrub_fs Qu Wenruo
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

This patch introduce a helper, scrub_fs_one_stripe().

Currently it's only doing the following work:

- Submit bios for each copy of 64K stripe
  We don't need to skip any range which doesn't have data/metadata.
  That would only eat up the IOPS performance of the disk.

  At per-stripe initialization time we have marked all sectors unused,
  until extent tree search time marks the needed sectors DATA/METADATA.

  So at verification time we can skip those unused sectors.

- Wait for the bios to finish

No csum verification yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 220 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 218 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 96daed3f3274..70a54c6d8888 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -203,6 +203,11 @@ struct scrub_ctx {
 #define SCRUB_FS_SECTOR_FLAG_META		(1 << 2)
 #define SCRUB_FS_SECTOR_FLAG_PARITY		(1 << 3)
 
+/* This marks if the sector belongs to a missing device. */
+#define SCRUB_FS_SECTOR_FLAG_DEV_MISSING	(1 << 4)
+#define SCRUB_FS_SECTOR_FLAG_IO_ERROR		(1 << 5)
+#define SCRUB_FS_SECTOR_FLAG_IO_DONE		(1 << 6)
+
 /*
  * Represent a sector.
  *
@@ -305,6 +310,8 @@ struct scrub_fs_ctx {
 	 * would point to the same location inside the buffer.
 	 */
 	u8				*csum_buf;
+
+	wait_queue_head_t		wait;
 };
 
 struct scrub_warning {
@@ -4559,6 +4566,7 @@ static struct scrub_fs_ctx *scrub_fs_alloc_ctx(struct btrfs_fs_info *fs_info,
 	sfctx->fs_info = fs_info;
 	sfctx->readonly = readonly;
 	atomic_set(&sfctx->sectors_under_io, 0);
+	init_waitqueue_head(&sfctx->wait);
 	return sfctx;
 error:
 	kfree(sfctx);
@@ -4936,6 +4944,213 @@ static void scrub_fs_reset_stripe(struct scrub_fs_ctx *sfctx)
 	}
 }
 
+static void mark_missing_dev_sectors(struct scrub_fs_ctx *sfctx,
+				     int stripe_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       fs_info->sectorsize_bits;
+	int i;
+
+	for (i = 0; i < sectors_per_stripe; i++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, i, stripe_nr);
+
+		sector->flags |= SCRUB_FS_SECTOR_FLAG_DEV_MISSING;
+	}
+}
+
+static struct page *scrub_fs_get_page(struct scrub_fs_ctx *sfctx,
+				      int sector_nr)
+{
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       sfctx->fs_info->sectorsize_bits;
+	int page_index;
+
+	/* Basic checks to make sure we're accessing a valid sector. */
+	ASSERT(sector_nr >= 0 && sector_nr < sfctx->nr_copies * sectors_per_stripe);
+
+	page_index = sector_nr / (PAGE_SIZE >> sfctx->fs_info->sectorsize_bits);
+
+	ASSERT(sfctx->pages[page_index]);
+	return sfctx->pages[page_index];
+}
+
+static unsigned int scrub_fs_get_page_offset(struct scrub_fs_ctx *sfctx,
+					     int sector_nr)
+{
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       sfctx->fs_info->sectorsize_bits;
+
+	/* Basic checks to make sure we're accessing a valid sector. */
+	ASSERT(sector_nr >= 0 && sector_nr < sfctx->nr_copies * sectors_per_stripe);
+
+	return offset_in_page(sector_nr << sfctx->fs_info->sectorsize_bits);
+}
+
+static int scrub_fs_get_stripe_nr(struct scrub_fs_ctx *sfctx,
+				  struct page *first_page,
+				  unsigned int page_off)
+{
+	const int pages_per_stripe = BTRFS_STRIPE_LEN >> PAGE_SHIFT;
+	bool found = false;
+	int i;
+
+	/* The first sector should always be page aligned. */
+	ASSERT(page_off == 0);
+
+	for (i = 0; i < pages_per_stripe * sfctx->nr_copies; i++) {
+		if (first_page == sfctx->pages[i]) {
+			found = true;
+			break;
+		}
+	}
+	if (!found)
+		return -1;
+
+	ASSERT(IS_ALIGNED(i, pages_per_stripe));
+
+	return i / pages_per_stripe;
+}
+
+static void scrub_fs_read_endio(struct bio *bio)
+{
+	struct scrub_fs_ctx *sfctx = bio->bi_private;
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct bio_vec *first_bvec = bio_first_bvec_all(bio);
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+	int bio_size = 0;
+	bool error = (bio->bi_status != BLK_STS_OK);
+	const int stripe_nr = scrub_fs_get_stripe_nr(sfctx, first_bvec->bv_page,
+						     first_bvec->bv_offset);
+	int i;
+
+	/* Grab the bio size for later sanity checks. */
+	bio_for_each_segment_all(bvec, bio, iter_all)
+		bio_size += bvec->bv_len;
+
+	/* We always submit a bio for a stripe length. */
+	ASSERT(bio_size == BTRFS_STRIPE_LEN);
+
+	for (i = 0; i < sfctx->sectors_per_stripe; i++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, i, stripe_nr);
+		/*
+		 * Here we only set the sector flags, don't do any stat update,
+		 * that will be done by the main thread when doing verification.
+		 */
+		if (error)
+			sector->flags |= SCRUB_FS_SECTOR_FLAG_IO_ERROR;
+		else
+			sector->flags |= SCRUB_FS_SECTOR_FLAG_IO_DONE;
+	}
+	atomic_sub(bio_size >> fs_info->sectorsize_bits,
+		   &sfctx->sectors_under_io);
+	wake_up(&sfctx->wait);
+	bio_put(bio);
+}
+
+static void submit_stripe_read_bio(struct scrub_fs_ctx *sfctx,
+				   struct btrfs_io_context *bioc,
+				   int stripe_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       fs_info->sectorsize_bits;
+	struct btrfs_io_stripe *stripe = &bioc->stripes[stripe_nr];
+	struct btrfs_device *dev = stripe->dev;
+	struct bio *bio;
+	int ret;
+	int i;
+
+	/*
+	 * Missing device, just mark the sectors with missing device
+	 * and continue to next copy.
+	 */
+	if (!dev || !dev->bdev) {
+		mark_missing_dev_sectors(sfctx, stripe_nr);
+		return;
+	}
+
+	/* Submit a bio to read the stripe length. */
+	bio = bio_alloc(dev->bdev, BIO_MAX_VECS,
+			REQ_OP_READ | REQ_BACKGROUND, GFP_KERNEL);
+
+	/* Bio is backed up by mempool, allocation should not fail. */
+	ASSERT(bio);
+
+	bio->bi_iter.bi_sector = stripe->physical >> SECTOR_SHIFT;
+	for (i = sectors_per_stripe * stripe_nr;
+	     i < sectors_per_stripe * (stripe_nr + 1); i++) {
+		struct page *page = scrub_fs_get_page(sfctx, i);
+		unsigned int page_off = scrub_fs_get_page_offset(sfctx, i);
+
+		ret = bio_add_page(bio, page, fs_info->sectorsize, page_off);
+
+		/*
+		 * Should not fail as we will at most add STRIPE_LEN / 4K
+		 * (aka, 16) sectors, way smaller than BIO_MAX_VECS.
+		 */
+		ASSERT(ret == fs_info->sectorsize);
+	}
+
+	bio->bi_private = sfctx;
+	bio->bi_end_io = scrub_fs_read_endio;
+	atomic_add(sectors_per_stripe, &sfctx->sectors_under_io);
+	submit_bio(bio);
+}
+
+static int scrub_fs_one_stripe(struct scrub_fs_ctx *sfctx)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct btrfs_io_context *bioc = NULL;
+	u64 mapped_len = BTRFS_STRIPE_LEN;
+	int i;
+	int ret;
+
+	/* We should at a stripe start inside current block group. */
+	ASSERT(sfctx->cur_bg->start <= sfctx->cur_logical &&
+	       sfctx->cur_logical < sfctx->cur_bg->start +
+				    sfctx->cur_bg->length);
+	ASSERT(IS_ALIGNED(sfctx->cur_logical - sfctx->cur_bg->start,
+			  BTRFS_STRIPE_LEN));
+
+	btrfs_bio_counter_inc_blocked(fs_info);
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
+			sfctx->cur_logical, &mapped_len, &bioc);
+	if (ret < 0)
+		goto out;
+
+	if (mapped_len < BTRFS_STRIPE_LEN) {
+		btrfs_err_rl(fs_info,
+	"get short map for bytenr %llu, got mapped length %llu expect %u",
+			     sfctx->cur_logical, mapped_len, BTRFS_STRIPE_LEN);
+		ret = -EUCLEAN;
+		sfctx->stat.nr_fatal_errors++;
+		goto out;
+	}
+
+	if (bioc->num_stripes != sfctx->nr_copies) {
+		btrfs_err_rl(fs_info,
+		"got unexpected number of stripes, got %d stripes expect %d",
+			     bioc->num_stripes, sfctx->nr_copies);
+		ret = -EUCLEAN;
+		sfctx->stat.nr_fatal_errors++;
+		goto out;
+	}
+
+	for (i = 0; i < sfctx->nr_copies; i++)
+		submit_stripe_read_bio(sfctx, bioc, i);
+	wait_event(sfctx->wait, atomic_read(&sfctx->sectors_under_io) == 0);
+
+	/* Place holder to verify the read data. */
+out:
+	btrfs_put_bioc(bioc);
+	btrfs_bio_counter_dec(fs_info);
+	return ret;
+}
+
 static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
 				struct btrfs_block_group *bg)
 {
@@ -4982,8 +5197,9 @@ static int scrub_fs_block_group(struct scrub_fs_ctx *sfctx,
 				break;
 		}
 
-		/* Place holder for real stripe scrubbing. */
-		ret = 0;
+		ret = scrub_fs_one_stripe(sfctx);
+		if (ret < 0)
+			break;
 
 		/* Reset the stripe for next run. */
 		scrub_fs_reset_stripe(sfctx);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH PoC 7/9] btrfs: scrub: implement metadata verification code for scrub_fs
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-09-03  8:19 ` [PATCH PoC 6/9] btrfs: scrub: submit and wait for the read of each copy Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 8/9] btrfs: scrub: implement data " Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report " Qu Wenruo
  8 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

This patch introduces the following functions:

- scrub_fs_verify_one_stripe()
  The entrance for all verification code.

  Which will iterate every sector in the same vertical stripe.

- scrub_fs_verify_meta()
  The helper to verify metadata in one vertical stripe.
  (Since no RAID56 support, one vertical stripe just contains
   all the same data from different mirrors)

- scrub_fs_verify_one_meta()
  This is the real work, the checks includes:

  * Basic metadata header checks (bytenr, fsid, level)
    For this part, we refactor those checks from
    validate_extent_buffer() into btrfs_validate_eb_basic(),
    allowing us to suppress the error messages.

  * Checksum verification
    For this part, we refactor this one check from
    validate_extent_buffer() into btrfs_validate_eb_csum(),
    allowing us to suppress the error message.

  * Tree check verification (NEW)
    This is the new one, the old scrub code never fully utilize the
    whole extent buffer related facilities, thus only very basic checks.
    Now scrub_fs has (almost) the same checks as tree block read
    routine.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c |  83 ++++++++++++++++-------
 fs/btrfs/disk-io.h |   2 +
 fs/btrfs/scrub.c   | 164 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 222 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e67614afcf4f..6eda9e615ae7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -457,55 +457,87 @@ static int check_tree_block_fsid(struct extent_buffer *eb)
 	return 1;
 }
 
-/* Do basic extent buffer checks at read time */
-static int validate_extent_buffer(struct extent_buffer *eb)
+/*
+ * The very basic extent buffer checks, including:
+ *
+ * - Bytenr check
+ * - FSID check
+ * - Level check
+ *
+ * If @error_message is true, it will output error message (rate limited).
+ */
+int btrfs_validate_eb_basic(struct extent_buffer *eb, bool error_message)
 {
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	u64 found_start;
-	const u32 csum_size = fs_info->csum_size;
 	u8 found_level;
-	u8 result[BTRFS_CSUM_SIZE];
-	const u8 *header_csum;
 	int ret = 0;
 
 	found_start = btrfs_header_bytenr(eb);
 	if (found_start != eb->start) {
-		btrfs_err_rl(fs_info,
+		if (error_message)
+			btrfs_err_rl(fs_info,
 			"bad tree block start, mirror %u want %llu have %llu",
-			     eb->read_mirror, eb->start, found_start);
-		ret = -EIO;
-		goto out;
+				     eb->read_mirror, eb->start, found_start);
+		return -EIO;
 	}
 	if (check_tree_block_fsid(eb)) {
-		btrfs_err_rl(fs_info, "bad fsid on logical %llu mirror %u",
-			     eb->start, eb->read_mirror);
-		ret = -EIO;
-		goto out;
+		if (error_message)
+			btrfs_err_rl(fs_info, "bad fsid on logical %llu mirror %u",
+				     eb->start, eb->read_mirror);
+		return -EIO;
 	}
 	found_level = btrfs_header_level(eb);
 	if (found_level >= BTRFS_MAX_LEVEL) {
-		btrfs_err(fs_info,
-			"bad tree block level, mirror %u level %d on logical %llu",
-			eb->read_mirror, btrfs_header_level(eb), eb->start);
-		ret = -EIO;
-		goto out;
+		if (error_message)
+			btrfs_err(fs_info,
+				"bad tree block level, mirror %u level %d on logical %llu",
+				eb->read_mirror, btrfs_header_level(eb), eb->start);
+		return -EIO;
 	}
+	return ret;
+}
+
+int btrfs_validate_eb_csum(struct extent_buffer *eb, bool error_message)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	u8 result[BTRFS_CSUM_SIZE];
+	const u8 *header_csum;
+	const u32 csum_size = fs_info->csum_size;
 
 	csum_tree_block(eb, result);
 	header_csum = page_address(eb->pages[0]) +
 		get_eb_offset_in_page(eb, offsetof(struct btrfs_header, csum));
 
 	if (memcmp(result, header_csum, csum_size) != 0) {
-		btrfs_warn_rl(fs_info,
+		if (error_message)
+			btrfs_warn_rl(fs_info,
 "checksum verify failed on logical %llu mirror %u wanted " CSUM_FMT " found " CSUM_FMT " level %d",
-			      eb->start, eb->read_mirror,
-			      CSUM_FMT_VALUE(csum_size, header_csum),
-			      CSUM_FMT_VALUE(csum_size, result),
-			      btrfs_header_level(eb));
-		ret = -EUCLEAN;
-		goto out;
+				      eb->start, eb->read_mirror,
+				      CSUM_FMT_VALUE(csum_size, header_csum),
+				      CSUM_FMT_VALUE(csum_size, result),
+				      btrfs_header_level(eb));
+		return -EUCLEAN;
 	}
+	return 0;
+}
+
+/* Do basic extent buffer checks at read time */
+static inline int validate_extent_buffer(struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	u8 found_level;
+	int ret = 0;
+
+	ret = btrfs_validate_eb_basic(eb, true);
+	if (ret < 0)
+		return ret;
 
+	ret = btrfs_validate_eb_csum(eb, true);
+	if (ret < 0)
+		return ret;
+
+	found_level = btrfs_header_level(eb);
 	/*
 	 * If this is a leaf block and it is corrupt, set the corrupt bit so
 	 * that we don't try and read the other copies of this block, just
@@ -525,7 +557,6 @@ static int validate_extent_buffer(struct extent_buffer *eb)
 		btrfs_err(fs_info,
 		"read time tree block corruption detected on logical %llu mirror %u",
 			  eb->start, eb->read_mirror);
-out:
 	return ret;
 }
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 47ad8e0a2d33..3d154873d4e5 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -80,6 +80,8 @@ void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
 int btrfs_validate_metadata_buffer(struct btrfs_bio *bbio,
 				   struct page *page, u64 start, u64 end,
 				   int mirror);
+int btrfs_validate_eb_basic(struct extent_buffer *eb, bool error_message);
+int btrfs_validate_eb_csum(struct extent_buffer *eb, bool error_message);
 void btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio, int mirror_num);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_root *btrfs_alloc_dummy_root(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 70a54c6d8888..f587d0373517 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -20,6 +20,7 @@
 #include "rcu-string.h"
 #include "raid56.h"
 #include "block-group.h"
+#include "tree-checker.h"
 #include "zoned.h"
 
 /*
@@ -208,6 +209,9 @@ struct scrub_ctx {
 #define SCRUB_FS_SECTOR_FLAG_IO_ERROR		(1 << 5)
 #define SCRUB_FS_SECTOR_FLAG_IO_DONE		(1 << 6)
 
+/* This marks if the sector is a good one (aka, passed all checks). */
+#define SCRUB_FS_SECTOR_FLAG_GOOD		(1 << 7)
+
 /*
  * Represent a sector.
  *
@@ -311,6 +315,11 @@ struct scrub_fs_ctx {
 	 */
 	u8				*csum_buf;
 
+	/*
+	 * This is for METADATA block group verification, we use this dummy eb
+	 * to utilize all the accessors for extent buffers.
+	 */
+	struct extent_buffer		*dummy_eb;
 	wait_queue_head_t		wait;
 };
 
@@ -4599,6 +4608,10 @@ static void scrub_fs_cleanup_for_bg(struct scrub_fs_ctx *sfctx)
 	kfree(sfctx->csum_buf);
 	sfctx->csum_buf = NULL;
 
+	if (sfctx->dummy_eb) {
+		free_extent_buffer_stale(sfctx->dummy_eb);
+		sfctx->dummy_eb = NULL;
+	}
 	/* NOTE: block group will only be put inside scrub_fs_iterate_bgs(). */
 	sfctx->cur_bg = NULL;
 }
@@ -4626,6 +4639,7 @@ static int scrub_fs_init_for_bg(struct scrub_fs_ctx *sfctx,
 	ASSERT(!sfctx->pages);
 	ASSERT(!sfctx->sectors);
 	ASSERT(!sfctx->csum_buf);
+	ASSERT(!sfctx->dummy_eb);
 
 	read_lock(&map_tree->lock);
 	em = lookup_extent_mapping(map_tree, bg->start, bg->length);
@@ -4683,6 +4697,12 @@ static int scrub_fs_init_for_bg(struct scrub_fs_ctx *sfctx,
 		if (!sfctx->csum_buf)
 			goto enomem;
 	}
+	if (bg->flags & (BTRFS_BLOCK_GROUP_METADATA |
+			 BTRFS_BLOCK_GROUP_SYSTEM)) {
+		sfctx->dummy_eb = alloc_dummy_extent_buffer(fs_info, 0);
+		if (!sfctx->dummy_eb)
+			goto enomem;
+	}
 	sfctx->cur_bg = bg;
 	sfctx->cur_logical = bg->start;
 	return 0;
@@ -5101,6 +5121,148 @@ static void submit_stripe_read_bio(struct scrub_fs_ctx *sfctx,
 	submit_bio(bio);
 }
 
+static void scrub_fs_verify_one_meta(struct scrub_fs_ctx *sfctx, int sector_nr,
+				     int mirror_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct extent_buffer *eb = sfctx->dummy_eb;
+	const int sectors_per_mirror = BTRFS_STRIPE_LEN >>
+				       fs_info->sectorsize_bits;
+	const u64 logical = sfctx->cur_logical + (sector_nr <<
+						  fs_info->sectorsize_bits);
+	const u64 expected_gen = sfctx->sectors[sector_nr].eb_generation;
+	int ret;
+	int i;
+
+	sfctx->stat.meta_scrubbed += fs_info->nodesize;
+
+	/* No IO is done for the sectors. Just update the accounting and exit. */
+	if (sfctx->sectors[sector_nr + mirror_nr * sectors_per_mirror].flags &
+	    (SCRUB_FS_SECTOR_FLAG_DEV_MISSING | SCRUB_FS_SECTOR_FLAG_IO_ERROR)) {
+		sfctx->stat.meta_io_fail += fs_info->nodesize;
+		return;
+	}
+
+	/* Sector_nr should always be the sector number in the first mirror. */
+	ASSERT(sector_nr >= 0 && sector_nr < sectors_per_mirror);
+	ASSERT(eb);
+
+	eb->start = logical;
+
+	/* Copy all the metadata sectors into the dummy eb. */
+	for (i = sector_nr + mirror_nr * sectors_per_mirror;
+	     i < sector_nr + mirror_nr * sectors_per_mirror +
+	     (fs_info->nodesize >> fs_info->sectorsize_bits); i++) {
+		struct page *page = scrub_fs_get_page(sfctx, i);
+		int page_off = scrub_fs_get_page_offset(sfctx, i);
+		int off_in_eb = (i - mirror_nr * sectors_per_mirror -
+				 sector_nr) << fs_info->sectorsize_bits;
+
+		write_extent_buffer(eb, page_address(page) + page_off,
+				    off_in_eb, fs_info->sectorsize);
+	}
+
+	/* Basic extent buffer checks. */
+	ret = btrfs_validate_eb_basic(eb, false);
+	if (ret < 0) {
+		sfctx->stat.meta_invalid += fs_info->nodesize;
+		return;
+	}
+	/* Csum checks. */
+	ret = btrfs_validate_eb_csum(eb, false);
+	if (ret < 0) {
+		sfctx->stat.meta_bad_csum += fs_info->nodesize;
+		return;
+	}
+	/* Full tree-check checks. */
+	if (btrfs_header_level(eb) > 0)
+		ret = btrfs_check_node(eb);
+	else
+		ret = btrfs_check_leaf_full(eb);
+	if (ret < 0) {
+		sfctx->stat.meta_invalid += fs_info->nodesize;
+		return;
+	}
+
+	/* Transid check */
+	if (btrfs_header_generation(eb) != expected_gen) {
+		sfctx->stat.meta_bad_transid += fs_info->nodesize;
+		return;
+	}
+
+	/*
+	 * All good, set involved sectors with GOOD_COPY flag so later we can
+	 * know if the veritical stripe has any good copy, thus if corrupted
+	 * sectors can be recoverable.
+	 */
+	for (i = sector_nr + mirror_nr * sectors_per_mirror;
+	     i < sector_nr + mirror_nr * sectors_per_mirror +
+	     (fs_info->nodesize >> fs_info->sectorsize_bits); i++)
+		sfctx->sectors[i].flags |= SCRUB_FS_SECTOR_FLAG_GOOD;
+}
+
+static void scrub_fs_verify_meta(struct scrub_fs_ctx *sfctx, int sector_nr)
+{
+	struct extent_buffer *eb = sfctx->dummy_eb;
+	int i;
+
+	/*
+	 * This should be allocated when we enter the metadata block group.
+	 * If not allocated, this means the block group flags is unreliable.
+	 *
+	 * Anyway we can only exit with invalid metadata errors increased.
+	 */
+	if (!eb) {
+		btrfs_err_rl(sfctx->fs_info,
+	"block group %llu flag 0x%llx should not have metadata at %llu",
+			     sfctx->cur_bg->start, sfctx->cur_bg->flags,
+			     sfctx->cur_logical +
+			     (sector_nr << sfctx->fs_info->sectorsize_bits));
+		sfctx->stat.meta_invalid += sfctx->fs_info->nodesize;
+		return;
+	}
+
+	for (i = 0; i < sfctx->nr_copies; i++)
+		scrub_fs_verify_one_meta(sfctx, sector_nr, i);
+}
+
+static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const int sectors_per_stripe = BTRFS_STRIPE_LEN >>
+				       fs_info->sectorsize_bits;
+	int sector_nr;
+
+	for (sector_nr = 0; sector_nr < sectors_per_stripe; sector_nr++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, 0);
+
+		/*
+		 * All sectors in the same veritical stripe should share
+		 * the same UNUSED/DATA/META flags, thus checking the UNUSED
+		 * flag of mirror 0 is enough to determine if this
+		 * vertical stripe needs verification.
+		 */
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_UNUSED)
+			continue;
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_META) {
+			scrub_fs_verify_meta(sfctx, sector_nr);
+			/*
+			 * For metadata, we need to skip to the end of the tree
+			 * block, and don't forget @sector_nr will get
+			 * increased by 1 from the for loop.
+			 */
+			sector_nr += (fs_info->nodesize >>
+				      fs_info->sectorsize_bits) - 1;
+			continue;
+		}
+
+		/* Place holder for data verification. */
+	}
+	/* Place holder for recoverable checks. */
+}
+
 static int scrub_fs_one_stripe(struct scrub_fs_ctx *sfctx)
 {
 	struct btrfs_fs_info *fs_info = sfctx->fs_info;
@@ -5144,7 +5306,7 @@ static int scrub_fs_one_stripe(struct scrub_fs_ctx *sfctx)
 		submit_stripe_read_bio(sfctx, bioc, i);
 	wait_event(sfctx->wait, atomic_read(&sfctx->sectors_under_io) == 0);
 
-	/* Place holder to verify the read data. */
+	scrub_fs_verify_one_stripe(sfctx);
 out:
 	btrfs_put_bioc(bioc);
 	btrfs_bio_counter_dec(fs_info);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH PoC 8/9] btrfs: scrub: implement data verification code for scrub_fs
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
                   ` (6 preceding siblings ...)
  2022-09-03  8:19 ` [PATCH PoC 7/9] btrfs: scrub: implement metadata verification code for scrub_fs Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03  8:19 ` [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report " Qu Wenruo
  8 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

For data verification it's much simpler, only need to do csum
verification and we have very handy helper for it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f587d0373517..145bd5c9601d 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -5226,6 +5226,60 @@ static void scrub_fs_verify_meta(struct scrub_fs_ctx *sfctx, int sector_nr)
 		scrub_fs_verify_one_meta(sfctx, sector_nr, i);
 }
 
+/* Convert a sector pointer to its index inside sfctx->sectors[] array. */
+static int scrub_fs_sector_to_sector_index(struct scrub_fs_ctx *sfctx,
+					   struct scrub_fs_sector *sector)
+{
+	int i;
+
+	ASSERT(sector);
+	for (i = 0; i < sfctx->total_sectors; i++) {
+		if (&sfctx->sectors[i] == sector)
+			break;
+	}
+	/* As long as @sector is from sfctx->sectors[], we should have found it. */
+	ASSERT(i < sfctx->total_sectors);
+	return i; 
+}
+
+static void scrub_fs_verify_one_data(struct scrub_fs_ctx *sfctx, int sector_nr,
+				     int mirror_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	struct scrub_fs_sector *sector =
+		scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+	int index = scrub_fs_sector_to_sector_index(sfctx, sector);
+	struct page *data_page = scrub_fs_get_page(sfctx, index);
+	unsigned int data_page_off = scrub_fs_get_page_offset(sfctx, index);
+	u8 csum_result[BTRFS_CSUM_SIZE];
+	u8 *csum_expected = sector->csum;
+	int ret;
+
+	sfctx->stat.data_scrubbed += fs_info->sectorsize;
+
+	/* No IO done, just increase the accounting. */
+	if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE)) {
+		sfctx->stat.data_io_fail += fs_info->sectorsize;
+		return;
+	}
+
+	/*
+	 * NODATASUM case, just skip, we will come back later to determine
+	 * the status when checking the sectors inside the same vertical stripe.
+	 */
+	if (!csum_expected)
+		return;
+
+	ret = btrfs_check_sector_csum(fs_info, data_page, data_page_off,
+				      csum_result, csum_expected);
+	if (ret < 0) {
+		sfctx->stat.data_csum_mismatch += fs_info->sectorsize;
+		return;
+	}
+	/* All good. */
+	sector->flags |= SCRUB_FS_SECTOR_FLAG_GOOD;
+}
+
 static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 {
 	struct btrfs_fs_info *fs_info = sfctx->fs_info;
@@ -5236,6 +5290,7 @@ static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 	for (sector_nr = 0; sector_nr < sectors_per_stripe; sector_nr++) {
 		struct scrub_fs_sector *sector =
 			scrub_fs_get_sector(sfctx, sector_nr, 0);
+		int mirror_nr;
 
 		/*
 		 * All sectors in the same veritical stripe should share
@@ -5258,7 +5313,8 @@ static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 			continue;
 		}
 
-		/* Place holder for data verification. */
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++)
+			scrub_fs_verify_one_data(sfctx, sector_nr, mirror_nr);
 	}
 	/* Place holder for recoverable checks. */
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report for scrub_fs
  2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
                   ` (7 preceding siblings ...)
  2022-09-03  8:19 ` [PATCH PoC 8/9] btrfs: scrub: implement data " Qu Wenruo
@ 2022-09-03  8:19 ` Qu Wenruo
  2022-09-03 11:22   ` kernel test robot
  2022-09-03 11:33   ` kernel test robot
  8 siblings, 2 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  8:19 UTC (permalink / raw)
  To: linux-btrfs

This patch introduces two new functions:
- scrub_fs_recover_meta()
- scrub_fs_recover_data()

Both of them are going to check if we can recovery the data/meta
sectors inside a vertical stripe, and if we can update the
stat->data/meta_recoverable accounting.

None of them has implemented the writeback function though.

For scrub_fs_recover_meta() it's not much different than the existing
scrub, just make sure we have one good copy them all the other copies
can be recoverable.

For scrub_fs_recover_data() besides the existing csum based
verification, for NODATASUM cases we will make sure all the copies match
each other.

If they don't match, we will report all the sectors as uncertain.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 197 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 196 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 145bd5c9601d..fdac9b8e4cf1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -212,6 +212,12 @@ struct scrub_ctx {
 /* This marks if the sector is a good one (aka, passed all checks). */
 #define SCRUB_FS_SECTOR_FLAG_GOOD		(1 << 7)
 
+/*
+ * This marks if the sectors is recovery (thus it should be corrupted in
+ * the first place )
+ */
+#define SCRUB_FS_SECTOR_FLAG_RECOVERABLE	(1 << 8)
+
 /*
  * Represent a sector.
  *
@@ -5280,6 +5286,177 @@ static void scrub_fs_verify_one_data(struct scrub_fs_ctx *sfctx, int sector_nr,
 	sector->flags |= SCRUB_FS_SECTOR_FLAG_GOOD;
 }
 
+static void scrub_fs_recover_meta(struct scrub_fs_ctx *sfctx, int sector_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	int nr_good = 0;
+	int mirror_nr;
+	int cur_sector;
+
+	for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
+			nr_good++;
+	}
+
+	/* No good copy found, just error out. */
+	if (!nr_good)
+		return;
+	/*
+	 * We have at least one good copy, and we also have bad copies,
+	 * all those bad copies can be recovered.
+	 */
+	sfctx->stat.meta_recoverable += (sfctx->nr_copies - nr_good) *
+					fs_info->nodesize;
+
+	/* Mark those bad sectors recoverable. */
+	for (cur_sector = sector_nr; cur_sector < sector_nr +
+	     (fs_info->nodesize >> fs_info->sectorsize_bits); cur_sector++) {
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+			struct scrub_fs_sector *sector =
+				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+			if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
+				continue;
+
+			sector->flags |= SCRUB_FS_SECTOR_FLAG_RECOVERABLE;
+		}
+	}
+	/* Place holder for writeback. */
+}
+
+static int scrub_fs_memcmp_sectors(struct scrub_fs_ctx *sfctx,
+				   struct scrub_fs_sector *sector1,
+				   struct scrub_fs_sector *sector2)
+{
+	struct page *page1;
+	struct page *page2;
+	unsigned int page_off1;
+	unsigned int page_off2;
+	int sector_nr1;
+	int sector_nr2;
+
+	sector_nr1 = scrub_fs_sector_to_sector_index(sfctx, sector1);
+	sector_nr2 = scrub_fs_sector_to_sector_index(sfctx, sector2);
+
+	page1 = scrub_fs_get_page(sfctx, sector_nr1);
+	page_off1 = scrub_fs_get_page_offset(sfctx, sector_nr1);
+
+	page2 = scrub_fs_get_page(sfctx, sector_nr2);
+	page_off2 = scrub_fs_get_page_offset(sfctx, sector_nr2);
+
+	return memcmp(page_address(page1) + page_off1,
+		      page_address(page2) + page_off2,
+		      sfctx->fs_info->sectorsize);
+}
+
+/* Find the next sector in the same vertical stripe which has read IO done. */
+static struct scrub_fs_sector *scrub_fs_find_next_working_mirror(
+		struct scrub_fs_ctx *sfctx, int sector_nr, int mirror_nr)
+{
+	int i;
+
+	for (i = mirror_nr + 1; i < sfctx->nr_copies; i++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, i);
+
+		/* This copy has IO error, skip it. */
+		if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
+			continue;
+
+		/* Found a good copy. */
+		return sector;
+	}
+	return NULL;
+}
+
+static void scrub_fs_recover_data(struct scrub_fs_ctx *sfctx, int sector_nr)
+{
+	struct btrfs_fs_info *fs_info = sfctx->fs_info;
+	const bool has_csum = !!sfctx->sectors[sector_nr].csum;
+	bool mismatch_found = false;
+	int nr_good = 0;
+	int io_fail = 0;
+	int mirror_nr;
+
+	for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
+			nr_good++;
+		if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
+			io_fail++;
+	}
+
+	if (has_csum) {
+		/*
+		 * There is at least one good copy, thus all the other
+		 * corrupted sectors can also be recovered.
+		 */
+		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
+			struct scrub_fs_sector *sector =
+				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+
+			if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
+				continue;
+			sector->flags |= SCRUB_FS_SECTOR_FLAG_RECOVERABLE;
+			sfctx->stat.data_recoverable += fs_info->sectorsize;
+		}
+
+		/* Place holder for writeback */
+		return;
+	}
+
+	/*
+	 * No datasum case, it's much harder.
+	 *
+	 * The idea is, we have to compare all the sectors to determine if they
+	 * match.
+	 *
+	 * Firstly rule out sectors which don't have extra working copies.
+	 */
+	if (sfctx->nr_copies - io_fail <= 1) {
+		sfctx->stat.data_nocsum_uncertain += fs_info->sectorsize;
+		return;
+	}
+
+	/*
+	 * For now, we can only support one case, all data read matches with each
+	 * other, or we consider them all uncertain.
+	 */
+	for (mirror_nr = 0; mirror_nr < sfctx->nr_copies - 1; mirror_nr++) {
+		struct scrub_fs_sector *sector =
+			scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
+		struct scrub_fs_sector *next_sector;
+		int ret;
+
+		/* The first sector has IO error, skip to the next run. */
+		if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
+			continue;
+
+		next_sector = scrub_fs_find_next_working_mirror(sfctx,
+				sector_nr, mirror_nr);
+		/* We're already the last working copy, can break now. */
+		if (!next_sector)
+			break;
+
+		ret = scrub_fs_memcmp_sectors(sfctx, sector, next_sector);
+		if (ret)
+			mismatch_found = true;
+	}
+
+	/*
+	 * We have found mismatched contents, mark all those sectors
+	 * which doesn't have IO error as uncertain.
+	 */
+	if (mismatch_found)
+		sfctx->stat.data_nocsum_uncertain +=
+			(sfctx->nr_copies - io_fail) << fs_info->sectorsize_bits;
+}
+
 static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 {
 	struct btrfs_fs_info *fs_info = sfctx->fs_info;
@@ -5316,7 +5493,25 @@ static void scrub_fs_verify_one_stripe(struct scrub_fs_ctx *sfctx)
 		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++)
 			scrub_fs_verify_one_data(sfctx, sector_nr, mirror_nr);
 	}
-	/* Place holder for recoverable checks. */
+
+	/*
+	 * Now all sectors are checked, re-check which sectors are recoverable.
+	 */
+	for (sector_nr  = 0; sector_nr < sectors_per_stripe; sector_nr++) {
+		struct scrub_fs_sector *sector = &sfctx->sectors[sector_nr];
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_UNUSED)
+			continue;
+
+		if (sector->flags & SCRUB_FS_SECTOR_FLAG_META) {
+			scrub_fs_recover_meta(sfctx, sector_nr);
+			sector_nr += (fs_info->nodesize >>
+				      fs_info->sectorsize_bits) - 1;
+			continue;
+		}
+
+		scrub_fs_recover_data(sfctx, sector_nr);
+	}
 }
 
 static int scrub_fs_one_stripe(struct scrub_fs_ctx *sfctx)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-03  8:19 ` [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
@ 2022-09-03  9:25   ` Wang Yugui
  2022-09-03  9:37     ` Qu Wenruo
  2022-09-03 11:47   ` kernel test robot
  2022-09-09  4:17   ` Wang Yugui
  2 siblings, 1 reply; 22+ messages in thread
From: Wang Yugui @ 2022-09-03  9:25 UTC (permalink / raw)
  To: linux-btrfs

Hi,

> The new ioctls are to address the disadvantages of the existing
> btrfs_scrub_dev():
> 
> a One thread per-device
>   This can cause multiple block groups to be marked read-only for scrub,
>   reducing available space temporarily.
> 
>   This also causes higher CPU/IO usage.
>   For scrub, we should use the minimal amount of CPU and cause less
>   IO when possible.
> 
> b Extra IO for RAID56
>   For data stripes, we will cause at least 2x IO if we run "btrfs scrub
>   start <mnt>".
>   1x from scrubbing the device of data stripe.
>   The other 1x from scrubbing the parity stripe.
> 
>   This duplicated IO should definitely be avoided
> 
> c Bad progress report for RAID56
>   We can not report any repaired P/Q bytes at all.
> 
> The a and b will be addressed by the new one thread per-fs
> btrfs_scrub_fs ioctl.

CRC check of scrub is CPU sensitive, so we still need multiple threads,
such as one thread per-fs but with additional threads pool based on
chunks?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/09/03



> While c will be addressed by the new btrfs_scrub_fs_progress structure,
> which has better comments and classification for all errors.
> 
> This patch is only a skeleton for the new family of ioctls, will return
> -EOPNOTSUPP for now.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ioctl.c           |   6 ++
>  include/uapi/linux/btrfs.h | 173 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 179 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index fe0cc816b4eb..3df3bcdf06eb 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -5508,6 +5508,12 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return btrfs_ioctl_scrub_cancel(fs_info);
>  	case BTRFS_IOC_SCRUB_PROGRESS:
>  		return btrfs_ioctl_scrub_progress(fs_info, argp);
> +	case BTRFS_IOC_SCRUB_FS:
> +		return -EOPNOTSUPP;
> +	case BTRFS_IOC_SCRUB_FS_CANCEL:
> +		return -EOPNOTSUPP;
> +	case BTRFS_IOC_SCRUB_FS_PROGRESS:
> +		return -EOPNOTSUPP;
>  	case BTRFS_IOC_BALANCE_V2:
>  		return btrfs_ioctl_balance(file, argp);
>  	case BTRFS_IOC_BALANCE_CTL:
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 7ada84e4a3ed..795ed33843ce 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -191,6 +191,174 @@ struct btrfs_ioctl_scrub_args {
>  	__u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8];
>  };
>  
> +struct btrfs_scrub_fs_progress {
> +	/*
> +	 * Fatal errors, including -ENOMEM, or csum/extent tree search errors.
> +	 *
> +	 * Normally after hitting such fatal errors, we error out, thus later
> +	 * accounting will no longer be reliable.
> +	 */
> +	__u16	nr_fatal_errors;
> +
> +	/*
> +	 * All super errors, from invalid members and IO error all go into
> +	 * nr_super_errors.
> +	 */
> +	__u16	nr_super_errors;
> +
> +	/* Super block accounting. */
> +	__u16	nr_super_scrubbed;
> +	__u16	nr_super_repaired;
> +
> +	/*
> +	 * Data accounting in bytes.
> +	 *
> +	 * We only care about how many bytes we scrubbed, thus no
> +	 * accounting for number of extents.
> +	 *
> +	 * This accounting includes the extra mirrors.
> +	 * E.g. for RAID1, one 16KiB extent will cause 32KiB in @data_scrubbed.
> +	 */
> +	__u64	data_scrubbed;
> +
> +	/* How many bytes can be recovered. */
> +	__u64	data_recoverable;
> +
> +	/*
> +	 * How many bytes are uncertain, this can only happen for NODATASUM
> +	 * cases.
> +	 * Including NODATASUM, and no extra mirror/parity to verify.
> +	 * Or has extra mirrors, but they mismatch with each other.
> +	 */
> +	__u64	data_nocsum_uncertain;
> +
> +	/*
> +	 * For data error bytes, these means determining errors, including:
> +	 *
> +	 * - IO failure, including missing dev.
> +	 * - Data csum mismatch
> +	 *   Csum tree search failure must go above case.
> +	 */
> +	__u64	data_io_fail;
> +	__u64	data_csum_mismatch;
> +
> +	/*
> +	 * All the unmentioned cases, including data matching its csum (of
> +	 * course, implies IO suceeded) and data has no csum but matches all
> +	 * other copies/parities, are the expected cases, no need to record.
> +	 */
> +
> +	/*
> +	 * Metadata accounting in bytes, pretty much the same as data.
> +	 *
> +	 * And since metadata has mandatory csum, there is no uncertain case.
> +	 */
> +	__u64	meta_scrubbed;
> +	__u64	meta_recoverable;
> +
> +	/*
> +	 * For meta, the checks are mostly progressive:
> +	 *
> +	 * - Unable to read
> +	 *   @meta_io_fail
> +	 *
> +	 * - Unable to pass basic sanity checks (e.g. bytenr check)
> +	 *   @meta_invalid
> +	 *
> +	 * - Pass basic sanity checks, but bad csum
> +	 *   @meta_bad_csum
> +	 *
> +	 * - Pass basic checks and csum, but bad transid
> +	 *   @meta_bad_transid
> +	 *
> +	 * - Pass all checks
> +	 *   The expected case, no special accounting needed.
> +	 */
> +	__u64 meta_io_fail;
> +	__u64 meta_invalid;
> +	__u64 meta_bad_csum;
> +	__u64 meta_bad_transid;
> +
> +	/*
> +	 * Parity accounting.
> +	 *
> +	 * NOTE: for unused data sectors (but still contributes to P/Q
> +	 * calculation, like the following case), they don't contribute
> +	 * to any accounting.
> +	 *
> +	 * Data 1:	|<--- Unused ---->| <<<
> +	 * Data 2:	|<- Data extent ->|
> +	 * Parity:	|<--- Parity ---->|
> +	 */
> +	__u64 parity_scrubbed;
> +	__u64 parity_recoverable;
> +
> +	/*
> +	 * This happens when there is not enough info to determine if the
> +	 * parity is correct, mostly happens when vertical stripes are
> +	 * *all* NODATASUM sectors.
> +	 *
> +	 * If there is any sector with checksum in the vertical stripe,
> +	 * parity itself will be no longer uncertain.
> +	 */
> +	__u64 parity_uncertain;
> +
> +	/*
> +	 * For parity, the checks are progressive too:
> +	 *
> +	 * - Unable to read
> +	 *   @parity_io_fail
> +	 *
> +	 * - Mismatch and any veritical data stripe has csum and
> +	 *   the data stripe csum matches
> +	 *   @parity_mismatch
> +	 *   We want to repair the parity then.
> +	 *
> +	 * - Mismatch and veritical data stripe has csum, and data
> +	 *   csum mismatch. And rebuilt data passes csum.
> +	 *   This will go @data_recoverable or @data_csum_mismatch instead.
> +	 *
> +	 * - Mismatch but no veritical data stripe has csum
> +	 *   @parity_uncertain
> +	 *
> +	 */
> +	__u64 parity_io_fail;
> +	__u64 parity_mismatch;
> +
> +	/* Padding to 256 bytes, and for later expansion. */
> +	__u64 __unused[15];
> +};
> +static_assert(sizeof(struct btrfs_scrub_fs_progress) == 256);
> +
> +/*
> + * Readonly scrub fs will not try any repair (thus *_repaired member
> + * in scrub_fs_progress should always be 0).
> + */
> +#define BTRFS_SCRUB_FS_FLAG_READONLY	(1ULL << 0)
> +
> +/*
> + * All supported flags.
> + *
> + * From the very beginning, scrub_fs ioctl would reject any unsupported
> + * flags, making later expansion much simper.
> + */
> +#define BTRFS_SCRUB_FS_FLAG_SUPP	(BTRFS_SCRUB_FS_FLAG_READONLY)
> +
> +struct btrfs_ioctl_scrub_fs_args {
> +	/* Input, logical bytenr to start the scrub */
> +	__u64 start;
> +
> +	/* Input, the logical bytenr end (inclusive) */
> +	__u64 end;
> +
> +	__u64 flags;
> +	__u64 reserved[8];
> +	struct btrfs_scrub_fs_progress progress; /* out */
> +
> +	/* pad to 1K */
> +	__u8 unused[1024 - 24 - 64 - sizeof(struct btrfs_scrub_fs_progress)];
> +};
> +
>  #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS	0
>  #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID	1
>  struct btrfs_ioctl_dev_replace_start_params {
> @@ -1137,5 +1305,10 @@ enum btrfs_err_code {
>  				    struct btrfs_ioctl_encoded_io_args)
>  #define BTRFS_IOC_ENCODED_WRITE _IOW(BTRFS_IOCTL_MAGIC, 64, \
>  				     struct btrfs_ioctl_encoded_io_args)
> +#define BTRFS_IOC_SCRUB_FS	_IOWR(BTRFS_IOCTL_MAGIC, 65, \
> +				      struct btrfs_ioctl_scrub_fs_args)
> +#define BTRFS_IOC_SCRUB_FS_CANCEL _IO(BTRFS_IOCTL_MAGIC, 66)
> +#define BTRFS_IOC_SCRUB_FS_PROGRESS _IOWR(BTRFS_IOCTL_MAGIC, 67, \
> +				       struct btrfs_ioctl_scrub_fs_args)
>  
>  #endif /* _UAPI_LINUX_BTRFS_H */
> -- 
> 2.37.3



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-03  9:25   ` Wang Yugui
@ 2022-09-03  9:37     ` Qu Wenruo
  2022-09-03  9:49       ` Wang Yugui
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03  9:37 UTC (permalink / raw)
  To: Wang Yugui, linux-btrfs



On 2022/9/3 17:25, Wang Yugui wrote:
> Hi,
>
>> The new ioctls are to address the disadvantages of the existing
>> btrfs_scrub_dev():
>>
>> a One thread per-device
>>    This can cause multiple block groups to be marked read-only for scrub,
>>    reducing available space temporarily.
>>
>>    This also causes higher CPU/IO usage.
>>    For scrub, we should use the minimal amount of CPU and cause less
>>    IO when possible.
>>
>> b Extra IO for RAID56
>>    For data stripes, we will cause at least 2x IO if we run "btrfs scrub
>>    start <mnt>".
>>    1x from scrubbing the device of data stripe.
>>    The other 1x from scrubbing the parity stripe.
>>
>>    This duplicated IO should definitely be avoided
>>
>> c Bad progress report for RAID56
>>    We can not report any repaired P/Q bytes at all.
>>
>> The a and b will be addressed by the new one thread per-fs
>> btrfs_scrub_fs ioctl.
>
> CRC check of scrub is CPU sensitive, so we still need multiple threads,
> such as one thread per-fs but with additional threads pool based on
> chunks?

This depends.

Scrub should be a background work, which can already be affected by
scheduling, and I don't think users would bother 5% or 10% longer
runtime for a several TB fs.

Furthermore if checksum in a single thread is going to be a bottleneck,
then I'd say your storage is already so fast that scrub duration is not
your primary concern any more.

Yes, it can be possible to offload the csum verification into multiple
threads, like one thread per mirror/device, but I don't want to
sacrifice readability for very minor performance improvement.

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/09/03
>
>
>
>> While c will be addressed by the new btrfs_scrub_fs_progress structure,
>> which has better comments and classification for all errors.
>>
>> This patch is only a skeleton for the new family of ioctls, will return
>> -EOPNOTSUPP for now.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/ioctl.c           |   6 ++
>>   include/uapi/linux/btrfs.h | 173 +++++++++++++++++++++++++++++++++++++
>>   2 files changed, 179 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index fe0cc816b4eb..3df3bcdf06eb 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -5508,6 +5508,12 @@ long btrfs_ioctl(struct file *file, unsigned int
>>   		return btrfs_ioctl_scrub_cancel(fs_info);
>>   	case BTRFS_IOC_SCRUB_PROGRESS:
>>   		return btrfs_ioctl_scrub_progress(fs_info, argp);
>> +	case BTRFS_IOC_SCRUB_FS:
>> +		return -EOPNOTSUPP;
>> +	case BTRFS_IOC_SCRUB_FS_CANCEL:
>> +		return -EOPNOTSUPP;
>> +	case BTRFS_IOC_SCRUB_FS_PROGRESS:
>> +		return -EOPNOTSUPP;
>>   	case BTRFS_IOC_BALANCE_V2:
>>   		return btrfs_ioctl_balance(file, argp);
>>   	case BTRFS_IOC_BALANCE_CTL:
>> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
>> index 7ada84e4a3ed..795ed33843ce 100644
>> --- a/include/uapi/linux/btrfs.h
>> +++ b/include/uapi/linux/btrfs.h
>> @@ -191,6 +191,174 @@ struct btrfs_ioctl_scrub_args {
>>   	__u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8];
>>   };
>>
>> +struct btrfs_scrub_fs_progress {
>> +	/*
>> +	 * Fatal errors, including -ENOMEM, or csum/extent tree search errors.
>> +	 *
>> +	 * Normally after hitting such fatal errors, we error out, thus later
>> +	 * accounting will no longer be reliable.
>> +	 */
>> +	__u16	nr_fatal_errors;
>> +
>> +	/*
>> +	 * All super errors, from invalid members and IO error all go into
>> +	 * nr_super_errors.
>> +	 */
>> +	__u16	nr_super_errors;
>> +
>> +	/* Super block accounting. */
>> +	__u16	nr_super_scrubbed;
>> +	__u16	nr_super_repaired;
>> +
>> +	/*
>> +	 * Data accounting in bytes.
>> +	 *
>> +	 * We only care about how many bytes we scrubbed, thus no
>> +	 * accounting for number of extents.
>> +	 *
>> +	 * This accounting includes the extra mirrors.
>> +	 * E.g. for RAID1, one 16KiB extent will cause 32KiB in @data_scrubbed.
>> +	 */
>> +	__u64	data_scrubbed;
>> +
>> +	/* How many bytes can be recovered. */
>> +	__u64	data_recoverable;
>> +
>> +	/*
>> +	 * How many bytes are uncertain, this can only happen for NODATASUM
>> +	 * cases.
>> +	 * Including NODATASUM, and no extra mirror/parity to verify.
>> +	 * Or has extra mirrors, but they mismatch with each other.
>> +	 */
>> +	__u64	data_nocsum_uncertain;
>> +
>> +	/*
>> +	 * For data error bytes, these means determining errors, including:
>> +	 *
>> +	 * - IO failure, including missing dev.
>> +	 * - Data csum mismatch
>> +	 *   Csum tree search failure must go above case.
>> +	 */
>> +	__u64	data_io_fail;
>> +	__u64	data_csum_mismatch;
>> +
>> +	/*
>> +	 * All the unmentioned cases, including data matching its csum (of
>> +	 * course, implies IO suceeded) and data has no csum but matches all
>> +	 * other copies/parities, are the expected cases, no need to record.
>> +	 */
>> +
>> +	/*
>> +	 * Metadata accounting in bytes, pretty much the same as data.
>> +	 *
>> +	 * And since metadata has mandatory csum, there is no uncertain case.
>> +	 */
>> +	__u64	meta_scrubbed;
>> +	__u64	meta_recoverable;
>> +
>> +	/*
>> +	 * For meta, the checks are mostly progressive:
>> +	 *
>> +	 * - Unable to read
>> +	 *   @meta_io_fail
>> +	 *
>> +	 * - Unable to pass basic sanity checks (e.g. bytenr check)
>> +	 *   @meta_invalid
>> +	 *
>> +	 * - Pass basic sanity checks, but bad csum
>> +	 *   @meta_bad_csum
>> +	 *
>> +	 * - Pass basic checks and csum, but bad transid
>> +	 *   @meta_bad_transid
>> +	 *
>> +	 * - Pass all checks
>> +	 *   The expected case, no special accounting needed.
>> +	 */
>> +	__u64 meta_io_fail;
>> +	__u64 meta_invalid;
>> +	__u64 meta_bad_csum;
>> +	__u64 meta_bad_transid;
>> +
>> +	/*
>> +	 * Parity accounting.
>> +	 *
>> +	 * NOTE: for unused data sectors (but still contributes to P/Q
>> +	 * calculation, like the following case), they don't contribute
>> +	 * to any accounting.
>> +	 *
>> +	 * Data 1:	|<--- Unused ---->| <<<
>> +	 * Data 2:	|<- Data extent ->|
>> +	 * Parity:	|<--- Parity ---->|
>> +	 */
>> +	__u64 parity_scrubbed;
>> +	__u64 parity_recoverable;
>> +
>> +	/*
>> +	 * This happens when there is not enough info to determine if the
>> +	 * parity is correct, mostly happens when vertical stripes are
>> +	 * *all* NODATASUM sectors.
>> +	 *
>> +	 * If there is any sector with checksum in the vertical stripe,
>> +	 * parity itself will be no longer uncertain.
>> +	 */
>> +	__u64 parity_uncertain;
>> +
>> +	/*
>> +	 * For parity, the checks are progressive too:
>> +	 *
>> +	 * - Unable to read
>> +	 *   @parity_io_fail
>> +	 *
>> +	 * - Mismatch and any veritical data stripe has csum and
>> +	 *   the data stripe csum matches
>> +	 *   @parity_mismatch
>> +	 *   We want to repair the parity then.
>> +	 *
>> +	 * - Mismatch and veritical data stripe has csum, and data
>> +	 *   csum mismatch. And rebuilt data passes csum.
>> +	 *   This will go @data_recoverable or @data_csum_mismatch instead.
>> +	 *
>> +	 * - Mismatch but no veritical data stripe has csum
>> +	 *   @parity_uncertain
>> +	 *
>> +	 */
>> +	__u64 parity_io_fail;
>> +	__u64 parity_mismatch;
>> +
>> +	/* Padding to 256 bytes, and for later expansion. */
>> +	__u64 __unused[15];
>> +};
>> +static_assert(sizeof(struct btrfs_scrub_fs_progress) == 256);
>> +
>> +/*
>> + * Readonly scrub fs will not try any repair (thus *_repaired member
>> + * in scrub_fs_progress should always be 0).
>> + */
>> +#define BTRFS_SCRUB_FS_FLAG_READONLY	(1ULL << 0)
>> +
>> +/*
>> + * All supported flags.
>> + *
>> + * From the very beginning, scrub_fs ioctl would reject any unsupported
>> + * flags, making later expansion much simper.
>> + */
>> +#define BTRFS_SCRUB_FS_FLAG_SUPP	(BTRFS_SCRUB_FS_FLAG_READONLY)
>> +
>> +struct btrfs_ioctl_scrub_fs_args {
>> +	/* Input, logical bytenr to start the scrub */
>> +	__u64 start;
>> +
>> +	/* Input, the logical bytenr end (inclusive) */
>> +	__u64 end;
>> +
>> +	__u64 flags;
>> +	__u64 reserved[8];
>> +	struct btrfs_scrub_fs_progress progress; /* out */
>> +
>> +	/* pad to 1K */
>> +	__u8 unused[1024 - 24 - 64 - sizeof(struct btrfs_scrub_fs_progress)];
>> +};
>> +
>>   #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS	0
>>   #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID	1
>>   struct btrfs_ioctl_dev_replace_start_params {
>> @@ -1137,5 +1305,10 @@ enum btrfs_err_code {
>>   				    struct btrfs_ioctl_encoded_io_args)
>>   #define BTRFS_IOC_ENCODED_WRITE _IOW(BTRFS_IOCTL_MAGIC, 64, \
>>   				     struct btrfs_ioctl_encoded_io_args)
>> +#define BTRFS_IOC_SCRUB_FS	_IOWR(BTRFS_IOCTL_MAGIC, 65, \
>> +				      struct btrfs_ioctl_scrub_fs_args)
>> +#define BTRFS_IOC_SCRUB_FS_CANCEL _IO(BTRFS_IOCTL_MAGIC, 66)
>> +#define BTRFS_IOC_SCRUB_FS_PROGRESS _IOWR(BTRFS_IOCTL_MAGIC, 67, \
>> +				       struct btrfs_ioctl_scrub_fs_args)
>>
>>   #endif /* _UAPI_LINUX_BTRFS_H */
>> --
>> 2.37.3
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-03  9:37     ` Qu Wenruo
@ 2022-09-03  9:49       ` Wang Yugui
  0 siblings, 0 replies; 22+ messages in thread
From: Wang Yugui @ 2022-09-03  9:49 UTC (permalink / raw)
  To: linux-btrfs

Hi,

> On 2022/9/3 17:25, Wang Yugui wrote:
> > Hi,
> >
> >> The new ioctls are to address the disadvantages of the existing
> >> btrfs_scrub_dev():
> >>
> >> a One thread per-device
> >>    This can cause multiple block groups to be marked read-only for scrub,
> >>    reducing available space temporarily.
> >>
> >>    This also causes higher CPU/IO usage.
> >>    For scrub, we should use the minimal amount of CPU and cause less
> >>    IO when possible.
> >>
> >> b Extra IO for RAID56
> >>    For data stripes, we will cause at least 2x IO if we run "btrfs scrub
> >>    start <mnt>".
> >>    1x from scrubbing the device of data stripe.
> >>    The other 1x from scrubbing the parity stripe.
> >>
> >>    This duplicated IO should definitely be avoided
> >>
> >> c Bad progress report for RAID56
> >>    We can not report any repaired P/Q bytes at all.
> >>
> >> The a and b will be addressed by the new one thread per-fs
> >> btrfs_scrub_fs ioctl.
> >
> > CRC check of scrub is CPU sensitive, so we still need multiple threads,
> > such as one thread per-fs but with additional threads pool based on
> > chunks?
> 
> This depends.
> 
> Scrub should be a background work, which can already be affected by
> scheduling, and I don't think users would bother 5% or 10% longer
> runtime for a several TB fs.
> 
> Furthermore if checksum in a single thread is going to be a bottleneck,
> then I'd say your storage is already so fast that scrub duration is not
> your primary concern any more.

scrub is sequence I/O, HDD is very fast too.
HDD*10  with HW RAID6 is very fast for scrub, about 2GB/s or more.

> Yes, it can be possible to offload the csum verification into multiple
> threads, like one thread per mirror/device, but I don't want to
> sacrifice readability for very minor performance improvement.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/09/03



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report for scrub_fs
  2022-09-03  8:19 ` [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report " Qu Wenruo
@ 2022-09-03 11:22   ` kernel test robot
  2022-09-03 11:33   ` kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2022-09-03 11:22 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: llvm, kbuild-all

Hi Qu,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on linus/master v6.0-rc3 next-20220901]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: arm-randconfig-r032-20220903 (https://download.01.org/0day-ci/archive/20220903/202209031950.mB6jb2hL-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project c55b41d5199d2394dd6cdb8f52180d8b81d809d4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/e6387ecfd7e78ac47fca972ef76f3286e6cd3900
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
        git checkout e6387ecfd7e78ac47fca972ef76f3286e6cd3900
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash fs/btrfs/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/scrub.c:5382:6: warning: variable 'nr_good' set but not used [-Wunused-but-set-variable]
           int nr_good = 0;
               ^
   1 warning generated.


vim +/nr_good +5382 fs/btrfs/scrub.c

  5376	
  5377	static void scrub_fs_recover_data(struct scrub_fs_ctx *sfctx, int sector_nr)
  5378	{
  5379		struct btrfs_fs_info *fs_info = sfctx->fs_info;
  5380		const bool has_csum = !!sfctx->sectors[sector_nr].csum;
  5381		bool mismatch_found = false;
> 5382		int nr_good = 0;
  5383		int io_fail = 0;
  5384		int mirror_nr;
  5385	
  5386		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
  5387			struct scrub_fs_sector *sector =
  5388				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
  5389	
  5390			if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
  5391				nr_good++;
  5392			if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
  5393				io_fail++;
  5394		}
  5395	
  5396		if (has_csum) {
  5397			/*
  5398			 * There is at least one good copy, thus all the other
  5399			 * corrupted sectors can also be recovered.
  5400			 */
  5401			for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
  5402				struct scrub_fs_sector *sector =
  5403					scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
  5404	
  5405				if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
  5406					continue;
  5407				sector->flags |= SCRUB_FS_SECTOR_FLAG_RECOVERABLE;
  5408				sfctx->stat.data_recoverable += fs_info->sectorsize;
  5409			}
  5410	
  5411			/* Place holder for writeback */
  5412			return;
  5413		}
  5414	
  5415		/*
  5416		 * No datasum case, it's much harder.
  5417		 *
  5418		 * The idea is, we have to compare all the sectors to determine if they
  5419		 * match.
  5420		 *
  5421		 * Firstly rule out sectors which don't have extra working copies.
  5422		 */
  5423		if (sfctx->nr_copies - io_fail <= 1) {
  5424			sfctx->stat.data_nocsum_uncertain += fs_info->sectorsize;
  5425			return;
  5426		}
  5427	
  5428		/*
  5429		 * For now, we can only support one case, all data read matches with each
  5430		 * other, or we consider them all uncertain.
  5431		 */
  5432		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies - 1; mirror_nr++) {
  5433			struct scrub_fs_sector *sector =
  5434				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
  5435			struct scrub_fs_sector *next_sector;
  5436			int ret;
  5437	
  5438			/* The first sector has IO error, skip to the next run. */
  5439			if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
  5440				continue;
  5441	
  5442			next_sector = scrub_fs_find_next_working_mirror(sfctx,
  5443					sector_nr, mirror_nr);
  5444			/* We're already the last working copy, can break now. */
  5445			if (!next_sector)
  5446				break;
  5447	
  5448			ret = scrub_fs_memcmp_sectors(sfctx, sector, next_sector);
  5449			if (ret)
  5450				mismatch_found = true;
  5451		}
  5452	
  5453		/*
  5454		 * We have found mismatched contents, mark all those sectors
  5455		 * which doesn't have IO error as uncertain.
  5456		 */
  5457		if (mismatch_found)
  5458			sfctx->stat.data_nocsum_uncertain +=
  5459				(sfctx->nr_copies - io_fail) << fs_info->sectorsize_bits;
  5460	}
  5461	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report for scrub_fs
  2022-09-03  8:19 ` [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report " Qu Wenruo
  2022-09-03 11:22   ` kernel test robot
@ 2022-09-03 11:33   ` kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2022-09-03 11:33 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: llvm, kbuild-all

Hi Qu,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on linus/master v6.0-rc3 next-20220901]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: hexagon-randconfig-r045-20220902 (https://download.01.org/0day-ci/archive/20220903/202209031939.5FTgOh5V-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project c55b41d5199d2394dd6cdb8f52180d8b81d809d4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/e6387ecfd7e78ac47fca972ef76f3286e6cd3900
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
        git checkout e6387ecfd7e78ac47fca972ef76f3286e6cd3900
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash fs/btrfs/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/scrub.c:5382:6: warning: variable 'nr_good' set but not used [-Wunused-but-set-variable]
           int nr_good = 0;
               ^
   1 warning generated.


vim +/nr_good +5382 fs/btrfs/scrub.c

  5376	
  5377	static void scrub_fs_recover_data(struct scrub_fs_ctx *sfctx, int sector_nr)
  5378	{
  5379		struct btrfs_fs_info *fs_info = sfctx->fs_info;
  5380		const bool has_csum = !!sfctx->sectors[sector_nr].csum;
  5381		bool mismatch_found = false;
> 5382		int nr_good = 0;
  5383		int io_fail = 0;
  5384		int mirror_nr;
  5385	
  5386		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
  5387			struct scrub_fs_sector *sector =
  5388				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
  5389	
  5390			if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
  5391				nr_good++;
  5392			if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
  5393				io_fail++;
  5394		}
  5395	
  5396		if (has_csum) {
  5397			/*
  5398			 * There is at least one good copy, thus all the other
  5399			 * corrupted sectors can also be recovered.
  5400			 */
  5401			for (mirror_nr = 0; mirror_nr < sfctx->nr_copies; mirror_nr++) {
  5402				struct scrub_fs_sector *sector =
  5403					scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
  5404	
  5405				if (sector->flags & SCRUB_FS_SECTOR_FLAG_GOOD)
  5406					continue;
  5407				sector->flags |= SCRUB_FS_SECTOR_FLAG_RECOVERABLE;
  5408				sfctx->stat.data_recoverable += fs_info->sectorsize;
  5409			}
  5410	
  5411			/* Place holder for writeback */
  5412			return;
  5413		}
  5414	
  5415		/*
  5416		 * No datasum case, it's much harder.
  5417		 *
  5418		 * The idea is, we have to compare all the sectors to determine if they
  5419		 * match.
  5420		 *
  5421		 * Firstly rule out sectors which don't have extra working copies.
  5422		 */
  5423		if (sfctx->nr_copies - io_fail <= 1) {
  5424			sfctx->stat.data_nocsum_uncertain += fs_info->sectorsize;
  5425			return;
  5426		}
  5427	
  5428		/*
  5429		 * For now, we can only support one case, all data read matches with each
  5430		 * other, or we consider them all uncertain.
  5431		 */
  5432		for (mirror_nr = 0; mirror_nr < sfctx->nr_copies - 1; mirror_nr++) {
  5433			struct scrub_fs_sector *sector =
  5434				scrub_fs_get_sector(sfctx, sector_nr, mirror_nr);
  5435			struct scrub_fs_sector *next_sector;
  5436			int ret;
  5437	
  5438			/* The first sector has IO error, skip to the next run. */
  5439			if (!(sector->flags & SCRUB_FS_SECTOR_FLAG_IO_DONE))
  5440				continue;
  5441	
  5442			next_sector = scrub_fs_find_next_working_mirror(sfctx,
  5443					sector_nr, mirror_nr);
  5444			/* We're already the last working copy, can break now. */
  5445			if (!next_sector)
  5446				break;
  5447	
  5448			ret = scrub_fs_memcmp_sectors(sfctx, sector, next_sector);
  5449			if (ret)
  5450				mismatch_found = true;
  5451		}
  5452	
  5453		/*
  5454		 * We have found mismatched contents, mark all those sectors
  5455		 * which doesn't have IO error as uncertain.
  5456		 */
  5457		if (mismatch_found)
  5458			sfctx->stat.data_nocsum_uncertain +=
  5459				(sfctx->nr_copies - io_fail) << fs_info->sectorsize_bits;
  5460	}
  5461	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-03  8:19 ` [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
  2022-09-03  9:25   ` Wang Yugui
@ 2022-09-03 11:47   ` kernel test robot
  2022-09-03 11:55       ` Qu Wenruo
  2022-09-09  4:17   ` Wang Yugui
  2 siblings, 1 reply; 22+ messages in thread
From: kernel test robot @ 2022-09-03 11:47 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all

Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master v6.0-rc3 next-20220901]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-a004 (https://download.01.org/0day-ci/archive/20220903/202209031916.ybFIwbdf-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-5) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/8f7e2c15c08dc87518d12529e5b5cba0a42b5eb1
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
        git checkout 8f7e2c15c08dc87518d12529e5b5cba0a42b5eb1
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <command-line>:
>> ./usr/include/linux/btrfs.h:329:15: error: expected declaration specifiers or '...' before 'sizeof'
     329 | static_assert(sizeof(struct btrfs_scrub_fs_progress) == 256);
         |               ^~~~~~

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-03 11:47   ` kernel test robot
@ 2022-09-03 11:55       ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03 11:55 UTC (permalink / raw)
  To: kernel test robot, Qu Wenruo, linux-btrfs; +Cc: kbuild-all



On 2022/9/3 19:47, kernel test robot wrote:
> Hi Qu,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on kdave/for-next]
> [also build test ERROR on linus/master v6.0-rc3 next-20220901]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> config: x86_64-randconfig-a004 (https://download.01.org/0day-ci/archive/20220903/202209031916.ybFIwbdf-lkp@intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-5) 11.3.0
> reproduce (this is a W=1 build):
>          # https://github.com/intel-lab-lkp/linux/commit/8f7e2c15c08dc87518d12529e5b5cba0a42b5eb1
>          git remote add linux-review https://github.com/intel-lab-lkp/linux
>          git fetch --no-tags linux-review Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
>          git checkout 8f7e2c15c08dc87518d12529e5b5cba0a42b5eb1
>          # save the config file
>          mkdir build_dir && cp config build_dir/.config
>          make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>     In file included from <command-line>:
>>> ./usr/include/linux/btrfs.h:329:15: error: expected declaration specifiers or '...' before 'sizeof'
>       329 | static_assert(sizeof(struct btrfs_scrub_fs_progress) == 256);
>           |               ^~~~~~
>
Hi LKP guys,

Is this a false alert from your tools?

This seems solid to me, and my compiler doesn't give me any error about
this...

Thanks,
Qu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
@ 2022-09-03 11:55       ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-03 11:55 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1994 bytes --]



On 2022/9/3 19:47, kernel test robot wrote:
> Hi Qu,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on kdave/for-next]
> [also build test ERROR on linus/master v6.0-rc3 next-20220901]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> config: x86_64-randconfig-a004 (https://download.01.org/0day-ci/archive/20220903/202209031916.ybFIwbdf-lkp(a)intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-5) 11.3.0
> reproduce (this is a W=1 build):
>          # https://github.com/intel-lab-lkp/linux/commit/8f7e2c15c08dc87518d12529e5b5cba0a42b5eb1
>          git remote add linux-review https://github.com/intel-lab-lkp/linux
>          git fetch --no-tags linux-review Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
>          git checkout 8f7e2c15c08dc87518d12529e5b5cba0a42b5eb1
>          # save the config file
>          mkdir build_dir && cp config build_dir/.config
>          make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>     In file included from <command-line>:
>>> ./usr/include/linux/btrfs.h:329:15: error: expected declaration specifiers or '...' before 'sizeof'
>       329 | static_assert(sizeof(struct btrfs_scrub_fs_progress) == 256);
>           |               ^~~~~~
>
Hi LKP guys,

Is this a false alert from your tools?

This seems solid to me, and my compiler doesn't give me any error about
this...

Thanks,
Qu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 5/9] btrfs: scrub: add helpers to fulfill csum/extent_generation
  2022-09-03  8:19 ` [PATCH PoC 5/9] btrfs: scrub: add helpers to fulfill csum/extent_generation Qu Wenruo
@ 2022-09-03 12:19   ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2022-09-03 12:19 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: llvm, kbuild-all

Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20220901]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: arm-randconfig-r031-20220903 (https://download.01.org/0day-ci/archive/20220903/202209032021.ryUdsh1f-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project c55b41d5199d2394dd6cdb8f52180d8b81d809d4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/02fbe0e7c98836561e983171c7cd0bcb8990915b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128
        git checkout 02fbe0e7c98836561e983171c7cd0bcb8990915b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "__aeabi_uldivmod" [fs/btrfs/btrfs.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [kbuild-all] Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-03 11:55       ` Qu Wenruo
  (?)
@ 2022-09-05  2:05       ` Chen, Rong A
  -1 siblings, 0 replies; 22+ messages in thread
From: Chen, Rong A @ 2022-09-05  2:05 UTC (permalink / raw)
  To: Qu Wenruo, kernel test robot, Qu Wenruo, linux-btrfs; +Cc: kbuild-all



On 9/3/2022 7:55 PM, Qu Wenruo wrote:
> 
> 
> On 2022/9/3 19:47, kernel test robot wrote:
>> Hi Qu,
>>
>> Thank you for the patch! Yet something to improve:
>>
>> [auto build test ERROR on kdave/for-next]
>> [also build test ERROR on linus/master v6.0-rc3 next-20220901]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use '--base' as documented in
>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>
>> url:    
>> https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128 
>>
>> base:   
>> https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
>> config: x86_64-randconfig-a004 
>> (https://download.01.org/0day-ci/archive/20220903/202209031916.ybFIwbdf-lkp@intel.com/config) 
>>
>> compiler: gcc-11 (Debian 11.3.0-5) 11.3.0
>> reproduce (this is a W=1 build):
>>          # 
>> https://github.com/intel-lab-lkp/linux/commit/8f7e2c15c08dc87518d12529e5b5cba0a42b5eb1 
>>
>>          git remote add linux-review 
>> https://github.com/intel-lab-lkp/linux
>>          git fetch --no-tags linux-review 
>> Qu-Wenruo/btrfs-scrub-introduce-a-new-family-of-ioctl-scrub_fs/20220903-162128 
>>
>>          git checkout 8f7e2c15c08dc87518d12529e5b5cba0a42b5eb1
>>          # save the config file
>>          mkdir build_dir && cp config build_dir/.config
>>          make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
>>
>> If you fix the issue, kindly add following tag where applicable
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All errors (new ones prefixed by >>):
>>
>>     In file included from <command-line>:
>>>> ./usr/include/linux/btrfs.h:329:15: error: expected declaration 
>>>> specifiers or '...' before 'sizeof'
>>       329 | static_assert(sizeof(struct btrfs_scrub_fs_progress) == 256);
>>           |               ^~~~~~
>>
> Hi LKP guys,
> 
> Is this a false alert from your tools?
> 
> This seems solid to me, and my compiler doesn't give me any error about
> this...

Hi Qu,

Sorry for the noise, if you can't reproduce it with the linked config, 
it's possible that the bot applied the patchset to a wrong base branch.

 
https://download.01.org/0day-ci/archive/20220903/202209031916.ybFIwbdf-lkp@intel.com/config 


Best Regards,
Rong Chen

> 
> Thanks,
> Qu
> _______________________________________________
> kbuild-all mailing list -- kbuild-all@lists.01.org
> To unsubscribe send an email to kbuild-all-leave@lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-03  8:19 ` [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
  2022-09-03  9:25   ` Wang Yugui
  2022-09-03 11:47   ` kernel test robot
@ 2022-09-09  4:17   ` Wang Yugui
  2022-09-09  6:57     ` Qu Wenruo
  2 siblings, 1 reply; 22+ messages in thread
From: Wang Yugui @ 2022-09-09  4:17 UTC (permalink / raw)
  To: linux-btrfs

Hi,

> The new ioctls are to address the disadvantages of the existing
> btrfs_scrub_dev():
> 
> a One thread per-device
>   This can cause multiple block groups to be marked read-only for scrub,
>   reducing available space temporarily.
> 
>   This also causes higher CPU/IO usage.
>   For scrub, we should use the minimal amount of CPU and cause less
>   IO when possible.
> 
> b Extra IO for RAID56
>   For data stripes, we will cause at least 2x IO if we run "btrfs scrub
>   start <mnt>".
>   1x from scrubbing the device of data stripe.
>   The other 1x from scrubbing the parity stripe.
> 
>   This duplicated IO should definitely be avoided
> 
> c Bad progress report for RAID56
>   We can not report any repaired P/Q bytes at all.
> 
> The a and b will be addressed by the new one thread per-fs
> btrfs_scrub_fs ioctl.
> 
> While c will be addressed by the new btrfs_scrub_fs_progress structure,
> which has better comments and classification for all errors.
> 
> This patch is only a skeleton for the new family of ioctls, will return
> -EOPNOTSUPP for now.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ioctl.c           |   6 ++
>  include/uapi/linux/btrfs.h | 173 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 179 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index fe0cc816b4eb..3df3bcdf06eb 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -5508,6 +5508,12 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return btrfs_ioctl_scrub_cancel(fs_info);
>  	case BTRFS_IOC_SCRUB_PROGRESS:
>  		return btrfs_ioctl_scrub_progress(fs_info, argp);
> +	case BTRFS_IOC_SCRUB_FS:
> +		return -EOPNOTSUPP;
> +	case BTRFS_IOC_SCRUB_FS_CANCEL:
> +		return -EOPNOTSUPP;
> +	case BTRFS_IOC_SCRUB_FS_PROGRESS:
> +		return -EOPNOTSUPP;

Could we add suspend/resume for scrub when huge filesysem?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/09/09



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  2022-09-09  4:17   ` Wang Yugui
@ 2022-09-09  6:57     ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-09-09  6:57 UTC (permalink / raw)
  To: Wang Yugui, linux-btrfs



On 2022/9/9 12:17, Wang Yugui wrote:
> Hi,
> 
>> The new ioctls are to address the disadvantages of the existing
>> btrfs_scrub_dev():
>>
>> a One thread per-device
>>    This can cause multiple block groups to be marked read-only for scrub,
>>    reducing available space temporarily.
>>
>>    This also causes higher CPU/IO usage.
>>    For scrub, we should use the minimal amount of CPU and cause less
>>    IO when possible.
>>
>> b Extra IO for RAID56
>>    For data stripes, we will cause at least 2x IO if we run "btrfs scrub
>>    start <mnt>".
>>    1x from scrubbing the device of data stripe.
>>    The other 1x from scrubbing the parity stripe.
>>
>>    This duplicated IO should definitely be avoided
>>
>> c Bad progress report for RAID56
>>    We can not report any repaired P/Q bytes at all.
>>
>> The a and b will be addressed by the new one thread per-fs
>> btrfs_scrub_fs ioctl.
>>
>> While c will be addressed by the new btrfs_scrub_fs_progress structure,
>> which has better comments and classification for all errors.
>>
>> This patch is only a skeleton for the new family of ioctls, will return
>> -EOPNOTSUPP for now.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/ioctl.c           |   6 ++
>>   include/uapi/linux/btrfs.h | 173 +++++++++++++++++++++++++++++++++++++
>>   2 files changed, 179 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index fe0cc816b4eb..3df3bcdf06eb 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -5508,6 +5508,12 @@ long btrfs_ioctl(struct file *file, unsigned int
>>   		return btrfs_ioctl_scrub_cancel(fs_info);
>>   	case BTRFS_IOC_SCRUB_PROGRESS:
>>   		return btrfs_ioctl_scrub_progress(fs_info, argp);
>> +	case BTRFS_IOC_SCRUB_FS:
>> +		return -EOPNOTSUPP;
>> +	case BTRFS_IOC_SCRUB_FS_CANCEL:
>> +		return -EOPNOTSUPP;
>> +	case BTRFS_IOC_SCRUB_FS_PROGRESS:
>> +		return -EOPNOTSUPP;
> 
> Could we add suspend/resume for scrub when huge filesysem?

It's POC, don't expect those.

It's to prove the idea works, and for discussion around how to improve 
the RAID56 scrub situation.

It's not determined to go this path, we may add extra flags without 
introducing a full ioctl number.

> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/09/09
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-09-09  6:57 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-03  8:19 [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs Qu Wenruo
2022-09-03  8:19 ` [PATCH PoC 1/9] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
2022-09-03  9:25   ` Wang Yugui
2022-09-03  9:37     ` Qu Wenruo
2022-09-03  9:49       ` Wang Yugui
2022-09-03 11:47   ` kernel test robot
2022-09-03 11:55     ` Qu Wenruo
2022-09-03 11:55       ` Qu Wenruo
2022-09-05  2:05       ` [kbuild-all] " Chen, Rong A
2022-09-09  4:17   ` Wang Yugui
2022-09-09  6:57     ` Qu Wenruo
2022-09-03  8:19 ` [PATCH PoC 2/9] btrfs: scrub: introduce place holder for btrfs_scrub_fs() Qu Wenruo
2022-09-03  8:19 ` [PATCH PoC 3/9] btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs() Qu Wenruo
2022-09-03  8:19 ` [PATCH PoC 4/9] btrfs: scrub: introduce place holder helper scrub_fs_block_group() Qu Wenruo
2022-09-03  8:19 ` [PATCH PoC 5/9] btrfs: scrub: add helpers to fulfill csum/extent_generation Qu Wenruo
2022-09-03 12:19   ` kernel test robot
2022-09-03  8:19 ` [PATCH PoC 6/9] btrfs: scrub: submit and wait for the read of each copy Qu Wenruo
2022-09-03  8:19 ` [PATCH PoC 7/9] btrfs: scrub: implement metadata verification code for scrub_fs Qu Wenruo
2022-09-03  8:19 ` [PATCH PoC 8/9] btrfs: scrub: implement data " Qu Wenruo
2022-09-03  8:19 ` [PATCH PoC 9/9] btrfs: scrub: implement recoverable sectors report " Qu Wenruo
2022-09-03 11:22   ` kernel test robot
2022-09-03 11:33   ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.