All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 01/12] btrfs: introduce a pure data checksum checking helper
       [not found] <cover.1651043617.git.wqu@suse.com>
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-28 13:26   ` Christoph Hellwig
  2022-04-27  7:18 ` [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting Qu Wenruo
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

Although we have several data csum verification code, we never have a
function really just to verify checksum for one sector.

Function check_data_csum() do extra work for error reporting, thus it
requires a lot of extra things like file offset, bio_offset etc.

Function btrfs_verify_data_csum() is even worse, it will utizlie page
checked flag, which means it can not be utilized for direct IO pages.

Here we introduce a new helper, btrfs_check_data_sector(), which really
only accept a sector in page, and expected checksum pointer.

We use this function to implement check_data_csum(), and export it for
incoming patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 10 ++++-----
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/inode.c       | 50 +++++++++++++++++++++++++++++++-----------
 3 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 8fda38a58706..69c060dc024c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -152,7 +152,6 @@ static int check_compressed_csum(struct btrfs_inode *inode, struct bio *bio,
 	const u32 sectorsize = fs_info->sectorsize;
 	struct page *page;
 	unsigned int i;
-	char *kaddr;
 	u8 csum[BTRFS_CSUM_SIZE];
 	struct compressed_bio *cb = bio->bi_private;
 	u8 *cb_sum = cb->sums;
@@ -175,12 +174,11 @@ static int check_compressed_csum(struct btrfs_inode *inode, struct bio *bio,
 		/* Hash through the page sector by sector */
 		for (pg_offset = 0; pg_offset < bytes_left;
 		     pg_offset += sectorsize) {
-			kaddr = kmap_atomic(page);
-			crypto_shash_digest(shash, kaddr + pg_offset,
-					    sectorsize, csum);
-			kunmap_atomic(kaddr);
+			int ret;
 
-			if (memcmp(&csum, cb_sum, csum_size) != 0) {
+			ret = btrfs_check_data_sector(fs_info, page, pg_offset,
+						      cb_sum);
+			if (ret) {
 				btrfs_print_data_csum_error(inode, disk_start,
 						csum, cb_sum, cb->mirror_num);
 				if (btrfs_bio(bio)->device)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dd23f78664f1..296fc2052aa9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3252,6 +3252,8 @@ u64 btrfs_file_extent_end(const struct btrfs_path *path);
 /* inode.c */
 void btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 			   int mirror_num, unsigned long bio_flags);
+int btrfs_check_data_sector(struct btrfs_fs_info *fs_info, struct page *page,
+			    u32 pgoff, u8 *csum_expected);
 unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
 				    u32 bio_offset, struct page *page,
 				    u64 start, u64 end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 507e3a0055e5..b4cfd78f50e8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3299,6 +3299,37 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 				       finish_ordered_fn, uptodate);
 }
 
+/*
+ * Just verify one sector of csum, without any extra handling.
+ *
+ * Functions like btrfs_verify_data_csum() have extra handling like setting
+ * page flags which is not suitable for call sites like direct IO.
+ *
+ * This pure csum checking allows us to utilize it for call sites which need
+ * to handle page for both buffered and direct IO.
+ */
+int btrfs_check_data_sector(struct btrfs_fs_info *fs_info, struct page *page,
+			    u32 pgoff, u8 *csum_expected)
+{
+	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	char *kaddr;
+	const u32 len = fs_info->sectorsize;
+	const u32 csum_size = fs_info->csum_size;
+	u8 csum[BTRFS_CSUM_SIZE];
+
+	ASSERT(pgoff + len <= PAGE_SIZE);
+
+	kaddr = kmap_atomic(page);
+	shash->tfm = fs_info->csum_shash;
+
+	crypto_shash_digest(shash, kaddr + pgoff, len, csum);
+	kunmap_atomic(kaddr);
+
+	if (memcmp(csum, csum_expected, csum_size))
+		return -EIO;
+	return 0;
+}
+
 /*
  * check_data_csum - verify checksum of one sector of uncompressed data
  * @inode:	inode
@@ -3309,15 +3340,16 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
  * @start:	logical offset in the file
  *
  * The length of such check is always one sector size.
+ *
+ * When csum mismatch detected, we will also report the error and fill the
+ * corrupted range with zero. (thus it needs the extra parameters)
  */
 static int check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
 			   u32 bio_offset, struct page *page, u32 pgoff,
 			   u64 start)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
-	char *kaddr;
-	u32 len = fs_info->sectorsize;
+	const u32 len = fs_info->sectorsize;
 	const u32 csum_size = fs_info->csum_size;
 	unsigned int offset_sectors;
 	u8 *csum_expected;
@@ -3328,17 +3360,9 @@ static int check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
 	offset_sectors = bio_offset >> fs_info->sectorsize_bits;
 	csum_expected = ((u8 *)bbio->csum) + offset_sectors * csum_size;
 
-	kaddr = kmap_atomic(page);
-	shash->tfm = fs_info->csum_shash;
-
-	crypto_shash_digest(shash, kaddr + pgoff, len, csum);
-	kunmap_atomic(kaddr);
-
-	if (memcmp(csum, csum_expected, csum_size))
-		goto zeroit;
+	if (!btrfs_check_data_sector(fs_info, page, pgoff, csum_expected))
+		return 0;
 
-	return 0;
-zeroit:
 	btrfs_print_data_csum_error(BTRFS_I(inode), start, csum, csum_expected,
 				    bbio->mirror_num);
 	if (bbio->device)
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting
       [not found] <cover.1651043617.git.wqu@suse.com>
  2022-04-27  7:18 ` [PATCH RFC v2 01/12] btrfs: introduce a pure data checksum checking helper Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-28  5:16   ` Qu Wenruo
  2022-04-28 13:32   ` Christoph Hellwig
  2022-04-27  7:18 ` [PATCH RFC v2 03/12] btrfs: remove duplicated parameters from submit_data_read_repair() Qu Wenruo
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

Lower level bio mapping, from driver to dm, even btrfs chunk mapping
can modify bio::bi_iter.

This prevents us from doing two things:

- Iterate the bio range of a cloned bio
  This is only utilized by direct IO, thus it's already using
  btrfs_bio::iter, which is populated in btrfs_bio_clone_partial().

- Grab the original logical bytenr of a bio
  This will be utilized by incoming read repair patches.

So to make sure all btrfs_bio submitted to own a proper iter, this patch
will assigned btrfs_bio::iter in the following call sites:

- btrfs_submit_dio_bio()
- submit_one_bio()
- submit_compressed_bio()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 2 ++
 fs/btrfs/extent_io.c   | 6 ++++++
 fs/btrfs/inode.c       | 2 ++
 3 files changed, 10 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 69c060dc024c..559bf53beaed 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -428,6 +428,8 @@ static blk_status_t submit_compressed_bio(struct btrfs_fs_info *fs_info,
 	blk_status_t ret;
 
 	ASSERT(bio->bi_iter.bi_size);
+	/* Check submit_one_bio() for the reason */
+	btrfs_bio(bio)->iter = bio->bi_iter;
 	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 07888cce3bce..a3962df30603 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -186,6 +186,12 @@ static void submit_one_bio(struct bio *bio, int mirror_num, unsigned long bio_fl
 	/* Caller should ensure the bio has at least some range added */
 	ASSERT(bio->bi_iter.bi_size);
 
+	/*
+	 * Save current bi_iter into btrfs_bio::iter, as lower level (including
+	 * btrfs chunk mapping) can modify bi_iter, preventing us from using
+	 * bi_iter to iterate cloned bio or grab the original logical bytenr.
+	 */
+	btrfs_bio(bio)->iter = bio->bi_iter;
 	if (is_data_inode(tree->private_data))
 		btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    bio_flags);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b4cfd78f50e8..1b596de0c4e9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7923,6 +7923,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 	bool write = btrfs_op(bio) == BTRFS_MAP_WRITE;
 	blk_status_t ret;
 
+	btrfs_bio(bio)->iter = bio->bi_iter;
+
 	/* Check btrfs_submit_bio_hook() for rules about async submit. */
 	if (async_submit)
 		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers);
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 03/12] btrfs: remove duplicated parameters from submit_data_read_repair()
       [not found] <cover.1651043617.git.wqu@suse.com>
  2022-04-27  7:18 ` [PATCH RFC v2 01/12] btrfs: introduce a pure data checksum checking helper Qu Wenruo
  2022-04-27  7:18 ` [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-28 13:32   ` Christoph Hellwig
  2022-04-27  7:18 ` [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

The function submit_data_read_repair() is only called for buffered data
read path, thus those members can be calculated using bvec directly:

- start
  start = page_offset(bvec->bv_page) + bvec->bv_offset;

- end
  end = start + bvec->bv_len - 1;

- page
  page = bvec->bv_page;

- pgoff
  pgoff = bvec->bv_offset;

Thus we can safely replace those 4 parameters with just one bio_vec.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a3962df30603..1f273ef966bd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2734,13 +2734,16 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 
 static blk_status_t submit_data_read_repair(struct inode *inode,
 					    struct bio *failed_bio,
-					    u32 bio_offset, struct page *page,
-					    unsigned int pgoff,
-					    u64 start, u64 end,
+					    u32 bio_offset,
+					    const struct bio_vec *bvec,
 					    int failed_mirror,
 					    unsigned int error_bitmap)
 {
+	const unsigned int pgoff = bvec->bv_offset;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct page *page = bvec->bv_page;
+	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
+	const u64 end = start + bvec->bv_len - 1;
 	const u32 sectorsize = fs_info->sectorsize;
 	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
 	int error = 0;
@@ -3100,10 +3103,8 @@ static void end_bio_extent_readpage(struct bio *bio)
 			 * submit_data_read_repair() will handle all the good
 			 * and bad sectors, we just continue to the next bvec.
 			 */
-			submit_data_read_repair(inode, bio, bio_offset, page,
-						start - page_offset(page),
-						start, end, mirror,
-						error_bitmap);
+			submit_data_read_repair(inode, bio, bio_offset, bvec,
+						mirror, error_bitmap);
 
 			ASSERT(bio_offset + len > bio_offset);
 			bio_offset += len;
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (2 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 03/12] btrfs: remove duplicated parameters from submit_data_read_repair() Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-28 13:37   ` Christoph Hellwig
  2022-04-27  7:18 ` [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

This patch will introduce a new infrastructure, btrfs_read_repair_ctrl,
to record exactly which sectors are corrupted at read time, and hold
other misc members for read time repair.

The new structure is going to be an on-stack structure for
end_bio_extent_readpage().

It would increase the stack memory usage (would be around 100 bytes),
but considering end_bio_extent_readpage() is the highest level endio
function and it's executed in a workqueue, the extra stack memory usage
should be OK.

Currently we only allocate two bitmaps and initialize various members,
no real work done yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 86 ++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/extent_io.h | 46 ++++++++++++++++++++++++
 2 files changed, 129 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1f273ef966bd..6304f694c8d6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2732,7 +2732,79 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
-static blk_status_t submit_data_read_repair(struct inode *inode,
+static int read_repair_add_sector(struct inode *inode,
+				  struct btrfs_read_repair_ctrl *ctrl,
+				  struct bio *failed_bio, u32 bio_offset)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+	if (!ctrl->initialized) {
+		const u32 sectorsize = fs_info->sectorsize;
+
+		ASSERT(ctrl->cur_bad_bitmap == NULL &&
+		       ctrl->prev_bad_bitmap == NULL);
+		/*
+		 * failed_bio->bi_iter is not reliable at endio time, thus we
+		 * must rely on btrfs_bio::iter to grab the original logical
+		 * bytenr.
+		 */
+		ASSERT(btrfs_bio(failed_bio)->iter.bi_size);
+
+		ctrl->initialized = true;
+		ctrl->failed_bio = failed_bio;
+		ctrl->logical = btrfs_bio(failed_bio)->iter.bi_sector <<
+				SECTOR_SHIFT;
+		ctrl->bio_size = btrfs_bio(failed_bio)->iter.bi_size;
+		ASSERT(ctrl->bio_size);
+		ASSERT(IS_ALIGNED(ctrl->bio_size, fs_info->sectorsize));
+		ctrl->error = false;
+		ctrl->inode = inode;
+		ctrl->init_mirror = btrfs_bio(failed_bio)->mirror_num;
+		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
+						    sectorsize);
+
+		ctrl->cur_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
+					fs_info->sectorsize_bits, GFP_NOFS);
+		ctrl->prev_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
+					fs_info->sectorsize_bits, GFP_NOFS);
+		/* Just set the error bit, so we will never try repair */
+		if (!ctrl->cur_bad_bitmap || !ctrl->prev_bad_bitmap) {
+			kfree(ctrl->cur_bad_bitmap);
+			kfree(ctrl->prev_bad_bitmap);
+			ctrl->cur_bad_bitmap = NULL;
+			ctrl->prev_bad_bitmap = NULL;
+			ctrl->error = true;
+		}
+	}
+	/*
+	 * We failed to allocate memory for bitmaps, some range has already been
+	 * marked error, no need to try repair anymore.
+	 */
+	if (ctrl->error)
+		return -ENOMEM;
+
+	ASSERT(ctrl->inode);
+	ASSERT(ctrl->inode == inode);
+	set_bit(bio_offset >> fs_info->sectorsize_bits, ctrl->cur_bad_bitmap);
+	return 0;
+}
+
+static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
+{
+	if (!ctrl->initialized)
+		return;
+
+	kfree(ctrl->cur_bad_bitmap);
+	kfree(ctrl->prev_bad_bitmap);
+	ctrl->cur_bad_bitmap = NULL;
+	ctrl->prev_bad_bitmap = NULL;
+	ctrl->initialized = false;
+	ctrl->error = false;
+	ctrl->failed_bio = NULL;
+}
+
+static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
+					    struct inode *inode,
 					    struct bio *failed_bio,
 					    u32 bio_offset,
 					    const struct bio_vec *bvec,
@@ -2779,6 +2851,12 @@ static blk_status_t submit_data_read_repair(struct inode *inode,
 			goto next;
 		}
 
+		/*
+		 * We currently only set the bitmap, no real repair, thus can
+		 * ignore the error for now.
+		 */
+		read_repair_add_sector(inode, ctrl, failed_bio,
+				       bio_offset + offset);
 		ret = btrfs_repair_one_sector(inode, failed_bio,
 				bio_offset + offset,
 				page, pgoff + offset, start + offset,
@@ -3015,6 +3093,7 @@ static struct extent_buffer *find_extent_buffer_readpage(
  */
 static void end_bio_extent_readpage(struct bio *bio)
 {
+	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	struct bio_vec *bvec;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	struct extent_io_tree *tree, *failure_tree;
@@ -3103,8 +3182,8 @@ static void end_bio_extent_readpage(struct bio *bio)
 			 * submit_data_read_repair() will handle all the good
 			 * and bad sectors, we just continue to the next bvec.
 			 */
-			submit_data_read_repair(inode, bio, bio_offset, bvec,
-						mirror, error_bitmap);
+			submit_data_read_repair(&ctrl, inode, bio, bio_offset,
+						bvec, mirror, error_bitmap);
 
 			ASSERT(bio_offset + len > bio_offset);
 			bio_offset += len;
@@ -3149,6 +3228,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 	}
 	/* Release the last extent */
 	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
+	read_repair_finish(&ctrl);
 	btrfs_bio_free_csum(bbio);
 	bio_put(bio);
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b390ec79f9a8..eff008ba194f 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -7,6 +7,7 @@
 #include <linux/refcount.h>
 #include <linux/fiemap.h>
 #include <linux/btrfs_tree.h>
+#include <linux/bitmap.h>
 #include "ulist.h"
 
 /*
@@ -102,6 +103,51 @@ struct extent_buffer {
 #endif
 };
 
+/* Strucutre for data read time repair. */
+struct btrfs_read_repair_ctrl {
+	struct inode *inode;
+
+	/* The initial failed bio, we will grab page/pgoff from it */
+	struct bio *failed_bio;
+
+	/* Currently bad bitmap. */
+	unsigned long *cur_bad_bitmap;
+
+	/*
+	 * Bad bitmap for previous mirror, the diff between
+	 * @cur_bad_bitmap and @prev_bad_bitmap shows the
+	 * repaired range, that will be used to writeback
+	 * the repaired data to corrupted mirror.
+	 */
+	unsigned long *prev_bad_bitmap;
+
+	/* The logical bytenr of the original bio. */
+	u64 logical;
+
+	/* File offset of the original bio. */
+	u64 file_offset;
+
+	/* The original bio size for the whole read. */
+	u32 bio_size;
+
+	/* Current mirror number we're reading from. */
+	u8 cur_mirror;
+
+	/* Initial mirror number we hit the first error. */
+	u8 init_mirror;
+
+	/* How many copies we have. */
+	u8 num_copies;
+
+	/*
+	 * If we hit fatal error during repair, if so we will not try any
+	 * repair but directly mark all existing and future bad range as
+	 * error.
+	 */
+	bool error;
+	bool initialized;
+};
+
 /*
  * Structure to record how many bytes and which ranges are set/cleared
  */
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (3 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-28  5:20   ` Qu Wenruo
  2022-04-27  7:18 ` [PATCH RFC v2 06/12] btrfs: introduce a helper to repair from one mirror Qu Wenruo
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

The new helper, read_repair_bio_add_sector(), will grab the page and
page_offset, and queue the sector into
btrfs_read_repair_ctrl::read_bios for later usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 107 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |   6 +++
 2 files changed, 113 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6304f694c8d6..fbed78ffe8e1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2732,6 +2732,110 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
+static struct page *read_repair_get_sector(struct btrfs_read_repair_ctrl *ctrl,
+					   int sector_nr, unsigned int *pgoff)
+{
+	const struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	const u32 target_offset = sector_nr << fs_info->sectorsize_bits;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	u32 offset = 0;
+
+	ASSERT(pgoff);
+	ASSERT((sector_nr << fs_info->sectorsize_bits) < ctrl->bio_size);
+
+	/*
+	 * This is definitely not effecient, but I don't have better way
+	 * to grab a specified bvec from a bio directly.
+	 */
+	__bio_for_each_segment(bvec, ctrl->failed_bio, iter,
+			       btrfs_bio(ctrl->failed_bio)->iter) {
+		if (target_offset - offset < bvec.bv_len) {
+			*pgoff = bvec.bv_offset + (target_offset - offset);
+			return bvec.bv_page;
+		}
+		offset += bvec.bv_len;
+	}
+	return NULL;
+}
+
+static void read_repair_end_bio(struct bio *bio)
+{
+	struct btrfs_read_repair_ctrl *ctrl = bio->bi_private;
+	const struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct bvec_iter_all iter_all;
+	struct bio_vec *bvec;
+	u64 logical = btrfs_bio(bio)->iter.bi_sector << SECTOR_SHIFT;
+	u32 offset = 0;
+	bool uptodate = (bio->bi_status == BLK_STS_OK);
+
+	/* We should not have csum in bbio */
+	ASSERT(!btrfs_bio(bio)->csum);
+	bio_for_each_segment_all(bvec, bio, iter_all) {
+		/*
+		 * If we have a successful read, clear the error bit.
+		 * In read_repair_finish(), we will re-check the csum
+		 * (if exists) later.
+		 */
+		if (uptodate)
+			clear_bit((logical + offset - ctrl->logical) >>
+				  fs_info->sectorsize_bits,
+				  ctrl->cur_bad_bitmap);
+		atomic_sub(bvec->bv_len, &ctrl->io_bytes);
+		wake_up(&ctrl->io_wait);
+		offset += bvec->bv_len;
+	}
+	bio_put(bio);
+}
+
+/* Add a sector into the read repair bios list for later submission */
+static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
+				       int sector_nr)
+{
+	const struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct page *page;
+	int pgoff;
+	struct bio *bio;
+	int ret;
+
+	page = read_repair_get_sector(ctrl, sector_nr, &pgoff);
+	ASSERT(page);
+
+	/* Check if the sector can be added to the last bio */
+	if (!bio_list_empty(&ctrl->read_bios)) {
+		bio = ctrl->read_bios.tail;
+		if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) + bio->bi_iter.bi_size ==
+		    ctrl->logical + (sector_nr << fs_info->sectorsize_bits))
+			goto add;
+	}
+	/*
+	 * Here we want to know the logical bytenr at endio time, so we can
+	 * update the bitmap.
+	 * Unfortunately our bi_private will be used, and bi_iter is not
+	 * reliable, thus we have to alloc btrfs_bio, even we just want
+	 * logical bytenr.
+	 */
+	bio = btrfs_bio_alloc(BIO_MAX_VECS);
+	/* It's backed by mempool, thus should not fail */
+	ASSERT(bio);
+
+	bio->bi_opf = REQ_OP_READ;
+	bio->bi_iter.bi_sector = ((sector_nr << fs_info->sectorsize_bits) +
+				  ctrl->logical) >> SECTOR_SHIFT;
+	bio->bi_private = ctrl;
+	bio->bi_end_io = read_repair_end_bio;
+	bio_list_add(&ctrl->read_bios, bio);
+
+add:
+	ret = bio_add_page(bio, page, fs_info->sectorsize, pgoff);
+	/*
+	 * We allocated the read bio with enough bvecs to contain
+	 * the original bio, thus it should not fail to add a sector.
+	 */
+	ASSERT(ret == fs_info->sectorsize);
+	atomic_add(fs_info->sectorsize, &ctrl->io_bytes);
+}
+
 static int read_repair_add_sector(struct inode *inode,
 				  struct btrfs_read_repair_ctrl *ctrl,
 				  struct bio *failed_bio, u32 bio_offset)
@@ -2762,6 +2866,9 @@ static int read_repair_add_sector(struct inode *inode,
 		ctrl->init_mirror = btrfs_bio(failed_bio)->mirror_num;
 		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
 						    sectorsize);
+		init_waitqueue_head(&ctrl->io_wait);
+		bio_list_init(&ctrl->read_bios);
+		atomic_set(&ctrl->io_bytes, 0);
 
 		ctrl->cur_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
 					fs_info->sectorsize_bits, GFP_NOFS);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index eff008ba194f..4904229ee73a 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -121,6 +121,12 @@ struct btrfs_read_repair_ctrl {
 	 */
 	unsigned long *prev_bad_bitmap;
 
+	struct bio_list read_bios;
+
+	wait_queue_head_t io_wait;
+
+	atomic_t io_bytes;
+
 	/* The logical bytenr of the original bio. */
 	u64 logical;
 
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 06/12] btrfs: introduce a helper to repair from one mirror
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (4 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-27  7:18 ` [PATCH RFC v2 07/12] btrfs: allow btrfs read repair to submit all writes in one go Qu Wenruo
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

The new helper, read_repair_from_one_mirror(), will repair the data by:

- Assemble a bio list for all corrupted sectors
  During the procedure, we will try to merge as many sectors as possible
  into one bio.
  This behavior is different from the old behavior, which will submit
  each sector using a new bio.

  This will reduce unnecessary calls on bio submission hooks.

- Submit the bios in the read_bio_list and wait for them to finish
  Here we don't want to waste time on re-search the csum.
  So here we introduce a new flag, EXTENT_BIO_SKIPCSUM,
  for btrfs_submit_data_bio() to skip the csum lookup.

- Each successful read will clear the bit in ctrl::cur_bad_bitmap

- Verify each sector of the newly read data
  We have several different combinations:

  * The read failed for one sector
    We just keep the bit in @cur_bad_bitmap, and leave it for next
    mirror.

  * The read succeeded, and the original bio has no data checksum
    We consider this a win, clear the error bit and update the page
    status

  * The read succeeded, but csum still mismatches
    Leave the error bit for next mirror.

  * The read succeeded, and csum matches
    Clear the error bit and update the page status.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 117 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |   1 +
 fs/btrfs/inode.c     |   2 +-
 3 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fbed78ffe8e1..7db6800cba31 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2732,6 +2732,11 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
+static int get_prev_mirror(int cur_mirror, int num_copy)
+{
+	return (cur_mirror - 1 <= 0) ? (num_copy) : cur_mirror - 1;
+}
+
 static struct page *read_repair_get_sector(struct btrfs_read_repair_ctrl *ctrl,
 					   int sector_nr, unsigned int *pgoff)
 {
@@ -2836,6 +2841,118 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 	atomic_add(fs_info->sectorsize, &ctrl->io_bytes);
 }
 
+static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
+					struct inode *inode, int mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	const int nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
+	const u64 failed_logical = ctrl->failed_bio->bi_iter.bi_sector <<
+				   SECTOR_SHIFT;
+	const u32 sectorsize = fs_info->sectorsize;
+	struct bio *bio;
+	int bit;
+
+	/* We shouldn't have any pending read */
+	ASSERT(bio_list_size(&ctrl->read_bios) == 0 &&
+	       atomic_read(&ctrl->io_bytes) == 0);
+
+	/*
+	 * @cur_bad_bitmap contains the corrupted sectors, save it to
+	 * @prev_bad_bitmap.
+	 * Now @cur_bad_bitmap is our workspace bitmap.
+	 */
+	bitmap_copy(ctrl->prev_bad_bitmap, ctrl->cur_bad_bitmap, nbits);
+
+	/* Queue all bad sectors into our read_bios list */
+	for_each_set_bit(bit, ctrl->prev_bad_bitmap, nbits)
+		read_repair_bio_add_sector(ctrl, bit);
+
+	/* Submit all bios in read_bios and wait for them to finish */
+	for (bio = bio_list_pop(&ctrl->read_bios); bio;
+	     bio = bio_list_pop(&ctrl->read_bios)) {
+		blk_status_t ret;
+
+		btrfs_bio(bio)->iter = bio->bi_iter;
+
+		ASSERT(bio_op(bio) == REQ_OP_READ);
+		ASSERT(bio->bi_private == ctrl);
+		ASSERT(bio->bi_end_io == read_repair_end_bio);
+
+		/*
+		 * Our endio is super atomic, and we don't want to waste time on
+		 * lookup data csum. So here we just call btrfs_map_bio()
+		 * directly.
+		 */
+		ret = btrfs_map_bio(fs_info, bio, mirror);
+		if (ret) {
+			bio->bi_status = ret;
+			bio_endio(bio);
+		}
+	}
+	wait_event(ctrl->io_wait, atomic_read(&ctrl->io_bytes) == 0);
+
+	/* Now re-verify the newly read out data */
+	for_each_set_bit(bit, ctrl->prev_bad_bitmap, nbits) {
+		struct extent_state *cached = NULL;
+		const u64 logical = failed_logical +
+				    (bit << fs_info->sectorsize_bits);
+		const u64 file_offset = ctrl->file_offset +
+					(bit << fs_info->sectorsize_bits);
+		struct page *page;
+		u8 *csum = NULL;
+		int pgoff;
+		int ret;
+
+		/*
+		 * We didn't get a successful read for this sector, keep the
+		 * bad sector for next mirror.
+		 */
+		if (test_bit(bit, ctrl->cur_bad_bitmap))
+			continue;
+
+		if (btrfs_bio(ctrl->failed_bio)->csum)
+			csum = btrfs_bio(ctrl->failed_bio)->csum +
+				bit * fs_info->csum_size;
+
+		page = read_repair_get_sector(ctrl, bit, &pgoff);
+		/*
+		 * No csum, and endio function has cleared the error bit, the data
+		 * is good now.
+		 */
+		if (!csum)
+			goto uptodate;
+
+		ret = btrfs_check_data_sector(fs_info, page, pgoff, csum);
+		/*
+		 * We got a good read, but contents still mismatch, keep the
+		 * bad sector for next mirror.
+		 */
+		if (ret) {
+			set_bit(bit, ctrl->cur_bad_bitmap);
+			continue;
+		}
+uptodate:
+		clear_bit(bit, ctrl->cur_bad_bitmap);
+		/*
+		 * We repaired one sector, write the correct data back
+		 * to the bad mirror. Note that this function do the
+		 * write synchronously, and can be optimized later.
+		 */
+		repair_io_failure(fs_info, btrfs_ino(BTRFS_I(inode)),
+			file_offset, sectorsize, logical, page, pgoff,
+			get_prev_mirror(mirror, ctrl->num_copies));
+
+		/* Also update the page status */
+		end_page_read(page, true, file_offset, sectorsize);
+		set_extent_uptodate(&BTRFS_I(inode)->io_tree,
+				file_offset, file_offset + sectorsize - 1,
+				&cached, GFP_ATOMIC);
+		unlock_extent_cached_atomic(&BTRFS_I(inode)->io_tree,
+				file_offset, file_offset + sectorsize - 1,
+				&cached);
+	}
+}
+
 static int read_repair_add_sector(struct inode *inode,
 				  struct btrfs_read_repair_ctrl *ctrl,
 				  struct bio *failed_bio, u32 bio_offset)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 4904229ee73a..8b2ccbb2813e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -15,6 +15,7 @@
  * type for this bio
  */
 #define EXTENT_BIO_COMPRESSED 1
+#define EXTENT_BIO_SKIPCSUM   2
 #define EXTENT_BIO_FLAG_SHIFT 16
 
 enum {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1b596de0c4e9..355e559358a3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2609,7 +2609,7 @@ void btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 			btrfs_submit_compressed_read(inode, bio, mirror_num,
 						     bio_flags);
 			return;
-		} else {
+		} else if (!(bio_flags & EXTENT_BIO_SKIPCSUM)) {
 			/*
 			 * Lookup bio sums does extra checks around whether we
 			 * need to csum or not, which is why we ignore skip_sum
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 07/12] btrfs: allow btrfs read repair to submit all writes in one go
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (5 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 06/12] btrfs: introduce a helper to repair from one mirror Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-27  7:18 ` [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine Qu Wenruo
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

Currently if we want to submit write for read time repair, we call
repair_io_failure(), which will submit the bio and wait for it.

But for our newer btrfs_read_repair infrastructure , we want to submit
all write bios in one go, and then wait for all bios to finish, just
like how we handle the read bios.

This patch will get rid of the repair_io_failure() call, replacing it
with the same bios handling, by assembling all bios into a bio_list,
then submit them all and wait for them.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 116 +++++++++++++++++++++++++++----------------
 fs/btrfs/extent_io.h |   2 +-
 2 files changed, 75 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7db6800cba31..bbdd8d1a966a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2795,20 +2795,28 @@ static void read_repair_end_bio(struct bio *bio)
 
 /* Add a sector into the read repair bios list for later submission */
 static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
-				       int sector_nr)
+				       int sector_nr, unsigned int opf)
 {
-	const struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
 	struct page *page;
 	int pgoff;
 	struct bio *bio;
 	int ret;
 
+	ASSERT(opf == REQ_OP_WRITE || opf == REQ_OP_READ);
+
+	/* For write, we need to handle zoned case first */
+	if (opf == REQ_OP_WRITE) {
+		if (btrfs_repair_one_zone(fs_info, ctrl->logical))
+			return;
+	}
+
 	page = read_repair_get_sector(ctrl, sector_nr, &pgoff);
 	ASSERT(page);
 
 	/* Check if the sector can be added to the last bio */
-	if (!bio_list_empty(&ctrl->read_bios)) {
-		bio = ctrl->read_bios.tail;
+	if (!bio_list_empty(&ctrl->bios)) {
+		bio = ctrl->bios.tail;
 		if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) + bio->bi_iter.bi_size ==
 		    ctrl->logical + (sector_nr << fs_info->sectorsize_bits))
 			goto add;
@@ -2824,12 +2832,12 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 	/* It's backed by mempool, thus should not fail */
 	ASSERT(bio);
 
-	bio->bi_opf = REQ_OP_READ;
+	bio->bi_opf = opf;
 	bio->bi_iter.bi_sector = ((sector_nr << fs_info->sectorsize_bits) +
 				  ctrl->logical) >> SECTOR_SHIFT;
 	bio->bi_private = ctrl;
 	bio->bi_end_io = read_repair_end_bio;
-	bio_list_add(&ctrl->read_bios, bio);
+	bio_list_add(&ctrl->bios, bio);
 
 add:
 	ret = bio_add_page(bio, page, fs_info->sectorsize, pgoff);
@@ -2841,19 +2849,57 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 	atomic_add(fs_info->sectorsize, &ctrl->io_bytes);
 }
 
+static void read_repair_submit_bio(struct btrfs_read_repair_ctrl *ctrl,
+				   struct bio *bio, int mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	blk_status_t ret;
+
+	btrfs_bio(bio)->iter = bio->bi_iter;
+
+	/*
+	 * Zoned repair will be handled separately, thus we would only have
+	 * regular write or read bios here.
+	 */
+	ASSERT(bio_op(bio) == REQ_OP_WRITE || bio_op(bio) == REQ_OP_READ);
+	ASSERT(bio->bi_private == ctrl);
+	ASSERT(bio->bi_end_io == read_repair_end_bio);
+
+	/*
+	 * Avoid races with device replace and make sure our bioc has devices
+	 * associated to its stripes that don't go away while we are doing the
+	 * write operation.
+	 */
+	if (bio_op(bio) == REQ_OP_WRITE)
+		btrfs_bio_counter_inc_blocked(fs_info);
+
+	/*
+	 * Our endio is super atomic, and we don't want to waste time on
+	 * lookup data csum. So here we just call btrfs_map_bio()
+	 * directly.
+	 */
+	ret = btrfs_map_bio(fs_info, bio, mirror);
+
+	if (bio_op(bio) == REQ_OP_WRITE)
+		btrfs_bio_counter_dec(fs_info);
+
+	if (ret) {
+		bio->bi_status = ret;
+		bio_endio(bio);
+	}
+}
+
 static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 					struct inode *inode, int mirror)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	const int nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
-	const u64 failed_logical = ctrl->failed_bio->bi_iter.bi_sector <<
-				   SECTOR_SHIFT;
 	const u32 sectorsize = fs_info->sectorsize;
 	struct bio *bio;
 	int bit;
 
 	/* We shouldn't have any pending read */
-	ASSERT(bio_list_size(&ctrl->read_bios) == 0 &&
+	ASSERT(bio_list_size(&ctrl->bios) == 0 &&
 	       atomic_read(&ctrl->io_bytes) == 0);
 
 	/*
@@ -2863,39 +2909,19 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 	 */
 	bitmap_copy(ctrl->prev_bad_bitmap, ctrl->cur_bad_bitmap, nbits);
 
-	/* Queue all bad sectors into our read_bios list */
+	/* Queue all bad sectors into our bios list */
 	for_each_set_bit(bit, ctrl->prev_bad_bitmap, nbits)
-		read_repair_bio_add_sector(ctrl, bit);
-
-	/* Submit all bios in read_bios and wait for them to finish */
-	for (bio = bio_list_pop(&ctrl->read_bios); bio;
-	     bio = bio_list_pop(&ctrl->read_bios)) {
-		blk_status_t ret;
+		read_repair_bio_add_sector(ctrl, bit, REQ_OP_READ);
 
-		btrfs_bio(bio)->iter = bio->bi_iter;
-
-		ASSERT(bio_op(bio) == REQ_OP_READ);
-		ASSERT(bio->bi_private == ctrl);
-		ASSERT(bio->bi_end_io == read_repair_end_bio);
-
-		/*
-		 * Our endio is super atomic, and we don't want to waste time on
-		 * lookup data csum. So here we just call btrfs_map_bio()
-		 * directly.
-		 */
-		ret = btrfs_map_bio(fs_info, bio, mirror);
-		if (ret) {
-			bio->bi_status = ret;
-			bio_endio(bio);
-		}
-	}
+	/* Submit all bios in bios list and wait for them to finish */
+	for (bio = bio_list_pop(&ctrl->bios); bio;
+	     bio = bio_list_pop(&ctrl->bios))
+		read_repair_submit_bio(ctrl, bio, mirror);
 	wait_event(ctrl->io_wait, atomic_read(&ctrl->io_bytes) == 0);
 
 	/* Now re-verify the newly read out data */
 	for_each_set_bit(bit, ctrl->prev_bad_bitmap, nbits) {
 		struct extent_state *cached = NULL;
-		const u64 logical = failed_logical +
-				    (bit << fs_info->sectorsize_bits);
 		const u64 file_offset = ctrl->file_offset +
 					(bit << fs_info->sectorsize_bits);
 		struct page *page;
@@ -2934,13 +2960,12 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 uptodate:
 		clear_bit(bit, ctrl->cur_bad_bitmap);
 		/*
-		 * We repaired one sector, write the correct data back
-		 * to the bad mirror. Note that this function do the
-		 * write synchronously, and can be optimized later.
+		 * We repaired one sector, queue this sector for later
+		 * writeback to recover the bad mirror.
+		 * Don't worry about zoned yet, it will be handled
+		 * in that function.
 		 */
-		repair_io_failure(fs_info, btrfs_ino(BTRFS_I(inode)),
-			file_offset, sectorsize, logical, page, pgoff,
-			get_prev_mirror(mirror, ctrl->num_copies));
+		read_repair_bio_add_sector(ctrl, bit, REQ_OP_WRITE);
 
 		/* Also update the page status */
 		end_page_read(page, true, file_offset, sectorsize);
@@ -2951,6 +2976,13 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 				file_offset, file_offset + sectorsize - 1,
 				&cached);
 	}
+
+	/* Submit and wait for all bios in the bios list */
+	for (bio = bio_list_pop(&ctrl->bios); bio;
+	     bio = bio_list_pop(&ctrl->bios))
+		read_repair_submit_bio(ctrl, bio,
+				get_prev_mirror(mirror, ctrl->num_copies));
+	wait_event(ctrl->io_wait, atomic_read(&ctrl->io_bytes) == 0);
 }
 
 static int read_repair_add_sector(struct inode *inode,
@@ -2984,7 +3016,7 @@ static int read_repair_add_sector(struct inode *inode,
 		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
 						    sectorsize);
 		init_waitqueue_head(&ctrl->io_wait);
-		bio_list_init(&ctrl->read_bios);
+		bio_list_init(&ctrl->bios);
 		atomic_set(&ctrl->io_bytes, 0);
 
 		ctrl->cur_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 8b2ccbb2813e..e056ebc7ae01 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -122,7 +122,7 @@ struct btrfs_read_repair_ctrl {
 	 */
 	unsigned long *prev_bad_bitmap;
 
-	struct bio_list read_bios;
+	struct bio_list bios;
 
 	wait_queue_head_t io_wait;
 
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (6 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 07/12] btrfs: allow btrfs read repair to submit all writes in one go Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-27 13:59   ` kernel test robot
  2022-04-28 10:08   ` kernel test robot
  2022-04-27  7:18 ` [PATCH RFC v2 09/12] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl Qu Wenruo
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

Instead of reply on failrec and failure_tree, we use the new
btrfs_read_repair_ctrl to do the repair.

The read time repair will have a different timing now:

- Function end_bio_extent_readpage() verifies the csum
  This is the same as usual.

- Function submit_data_read_repair() will call read_repair_add_sector()
  This function will only record the corrupted sectors.

- When all the bvec are iterated, call read_repair_finish() to do the
  repair
  Which will assemble the bios, submit them, wait for them to finish,
  and re-check the csum for all the remaining mirrors.

Now this means, end_bio_extent_readpage() will handle all the read
pair in-house, without re-entry the same endio calls.

Now switch data read path to btrfs_read_repair* based solution.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 82 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 66 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bbdd8d1a966a..d747a2e1fe0e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2732,11 +2732,22 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
+static int get_next_mirror(int cur_mirror, int num_copy)
+{
+	return (cur_mirror + 1 > num_copy) ? (cur_mirror + 1 - num_copy) :
+		cur_mirror + 1;
+}
+
 static int get_prev_mirror(int cur_mirror, int num_copy)
 {
 	return (cur_mirror - 1 <= 0) ? (num_copy) : cur_mirror - 1;
 }
 
+static bool bitmap_all_zero(unsigned long *bitmap, int nbits)
+{
+	return (find_first_bit(bitmap, nbits) == nbits);
+}
+
 static struct page *read_repair_get_sector(struct btrfs_read_repair_ctrl *ctrl,
 					   int sector_nr, unsigned int *pgoff)
 {
@@ -2987,7 +2998,8 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 
 static int read_repair_add_sector(struct inode *inode,
 				  struct btrfs_read_repair_ctrl *ctrl,
-				  struct bio *failed_bio, u32 bio_offset)
+				  struct bio *failed_bio, u32 bio_offset,
+				  u64 file_offset)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 
@@ -3010,6 +3022,7 @@ static int read_repair_add_sector(struct inode *inode,
 		ctrl->bio_size = btrfs_bio(failed_bio)->iter.bi_size;
 		ASSERT(ctrl->bio_size);
 		ASSERT(IS_ALIGNED(ctrl->bio_size, fs_info->sectorsize));
+		ctrl->file_offset = file_offset - bio_offset;
 		ctrl->error = false;
 		ctrl->inode = inode;
 		ctrl->init_mirror = btrfs_bio(failed_bio)->mirror_num;
@@ -3047,9 +3060,53 @@ static int read_repair_add_sector(struct inode *inode,
 
 static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 {
+	struct btrfs_fs_info *fs_info;
+	unsigned int nbits;
+	u32 sectorsize;
+	int bit;
+	int i;
+
 	if (!ctrl->initialized)
 		return;
 
+	/*
+	 * Got a critical -ENOMEM error preivously, no repair should have been
+	 * attempted.
+	 */
+	if (ctrl->error) {
+		ASSERT(bio_list_empty(&ctrl->bios));
+		ASSERT(atomic_read(&ctrl->io_bytes) == 0);
+		goto mark_error;
+	}
+
+	ASSERT(ctrl->inode);
+	fs_info = btrfs_sb(ctrl->inode->i_sb);
+	nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
+	sectorsize = fs_info->sectorsize;
+
+	/* Go through each remaining mirrors to do the repair */
+	for (i = get_next_mirror(ctrl->init_mirror, ctrl->num_copies);
+	     i != ctrl->init_mirror; i = get_next_mirror(i, ctrl->num_copies)) {
+		read_repair_from_one_mirror(ctrl, ctrl->inode, i);
+
+		/* Check the error bitmap to see if no more corrupted sectors */
+		if (bitmap_all_zero(ctrl->cur_bad_bitmap, nbits))
+			break;
+	}
+mark_error:
+	/* Finish the unrecovered bad sectors */
+	for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
+		struct page *page;
+		unsigned int pgoff;
+		u64 file_offset = (bit << fs_info->sectorsize_bits) +
+				  ctrl->file_offset;
+
+		page = read_repair_get_sector(ctrl, bit, &pgoff);
+
+		end_page_read(page, false, file_offset, sectorsize);
+		unlock_extent_cached_atomic(&BTRFS_I(ctrl->inode)->io_tree,
+				file_offset, file_offset + sectorsize - 1, NULL);
+	}
 	kfree(ctrl->cur_bad_bitmap);
 	kfree(ctrl->prev_bad_bitmap);
 	ctrl->cur_bad_bitmap = NULL;
@@ -3057,6 +3114,8 @@ static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 	ctrl->initialized = false;
 	ctrl->error = false;
 	ctrl->failed_bio = NULL;
+	ASSERT(bio_list_empty(&ctrl->bios));
+	ASSERT(atomic_read(&ctrl->io_bytes) == 0);
 }
 
 static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
@@ -3067,7 +3126,6 @@ static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
 					    int failed_mirror,
 					    unsigned int error_bitmap)
 {
-	const unsigned int pgoff = bvec->bv_offset;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct page *page = bvec->bv_page;
 	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
@@ -3107,22 +3165,14 @@ static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
 			goto next;
 		}
 
-		/*
-		 * We currently only set the bitmap, no real repair, thus can
-		 * ignore the error for now.
-		 */
-		read_repair_add_sector(inode, ctrl, failed_bio,
-				       bio_offset + offset);
-		ret = btrfs_repair_one_sector(inode, failed_bio,
-				bio_offset + offset,
-				page, pgoff + offset, start + offset,
-				failed_mirror, btrfs_submit_data_bio);
+		ret = read_repair_add_sector(inode, ctrl, failed_bio,
+					     bio_offset + offset,
+					     start + offset);
 		if (!ret) {
 			/*
-			 * We have submitted the read repair, the page release
-			 * will be handled by the endio function of the
-			 * submitted repair bio.
-			 * Thus we don't need to do any thing here.
+			 * We have recorded the corrupted sector, will submit
+			 * read bios and handle the page status update later
+			 * in read_repair_finish().
 			 */
 			continue;
 		}
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 09/12] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (7 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-27  7:18 ` [PATCH RFC v2 10/12] btrfs: cleanup btrfs_repair_one_sector() Qu Wenruo
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

The framework of btrfs_read_repair_ctrl has already taken dio into
consideration, we just need to skip all the page status/extent state
update for DIO.

This patch will introduce a new bool member,
btrfs_read_repair_contrl::dio, to indicate if we're doing read repair
for dio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 68 +++++++++++++++++++++++++++++---------------
 fs/btrfs/extent_io.h | 11 +++++++
 fs/btrfs/inode.c     | 32 ++++++---------------
 3 files changed, 65 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d747a2e1fe0e..515de39e70f7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2790,7 +2790,7 @@ static void read_repair_end_bio(struct bio *bio)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		/*
 		 * If we have a successful read, clear the error bit.
-		 * In read_repair_finish(), we will re-check the csum
+		 * In btrfs_read_repair_finish(), we will re-check the csum
 		 * (if exists) later.
 		 */
 		if (uptodate)
@@ -2978,14 +2978,16 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 		 */
 		read_repair_bio_add_sector(ctrl, bit, REQ_OP_WRITE);
 
-		/* Also update the page status */
-		end_page_read(page, true, file_offset, sectorsize);
-		set_extent_uptodate(&BTRFS_I(inode)->io_tree,
-				file_offset, file_offset + sectorsize - 1,
-				&cached, GFP_ATOMIC);
-		unlock_extent_cached_atomic(&BTRFS_I(inode)->io_tree,
-				file_offset, file_offset + sectorsize - 1,
-				&cached);
+		if (!ctrl->dio) {
+			/* Also update the page status */
+			end_page_read(page, true, file_offset, sectorsize);
+			set_extent_uptodate(&BTRFS_I(inode)->io_tree,
+					file_offset, file_offset + sectorsize - 1,
+					&cached, GFP_ATOMIC);
+			unlock_extent_cached_atomic(&BTRFS_I(inode)->io_tree,
+					file_offset, file_offset + sectorsize - 1,
+					&cached);
+		}
 	}
 
 	/* Submit and wait for all bios in the bios list */
@@ -2996,10 +2998,10 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 	wait_event(ctrl->io_wait, atomic_read(&ctrl->io_bytes) == 0);
 }
 
-static int read_repair_add_sector(struct inode *inode,
-				  struct btrfs_read_repair_ctrl *ctrl,
-				  struct bio *failed_bio, u32 bio_offset,
-				  u64 file_offset)
+int btrfs_read_repair_add_sector(struct inode *inode,
+				 struct btrfs_read_repair_ctrl *ctrl,
+				 struct bio *failed_bio, u32 bio_offset,
+				 u64 file_offset, bool dio)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 
@@ -3025,6 +3027,7 @@ static int read_repair_add_sector(struct inode *inode,
 		ctrl->file_offset = file_offset - bio_offset;
 		ctrl->error = false;
 		ctrl->inode = inode;
+		ctrl->dio = dio;
 		ctrl->init_mirror = btrfs_bio(failed_bio)->mirror_num;
 		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
 						    sectorsize);
@@ -3054,20 +3057,29 @@ static int read_repair_add_sector(struct inode *inode,
 
 	ASSERT(ctrl->inode);
 	ASSERT(ctrl->inode == inode);
+	ASSERT(ctrl->dio == dio);
 	set_bit(bio_offset >> fs_info->sectorsize_bits, ctrl->cur_bad_bitmap);
 	return 0;
 }
 
-static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
+/*
+ * Handle all the recorded corrupted sectors.
+ *
+ * Return 0 if we repaired all corrupted sectors.
+ * Return -EIO if we can not repair all corrupted sectors, this is mostly for
+ * dio routine, as we can not just mark page error for dio.
+ */
+int btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 {
 	struct btrfs_fs_info *fs_info;
 	unsigned int nbits;
 	u32 sectorsize;
+	bool has_error = false;
 	int bit;
 	int i;
 
 	if (!ctrl->initialized)
-		return;
+		return 0;
 
 	/*
 	 * Got a critical -ENOMEM error preivously, no repair should have been
@@ -3101,11 +3113,14 @@ static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 		u64 file_offset = (bit << fs_info->sectorsize_bits) +
 				  ctrl->file_offset;
 
-		page = read_repair_get_sector(ctrl, bit, &pgoff);
+		if (!ctrl->dio) {
+			page = read_repair_get_sector(ctrl, bit, &pgoff);
 
-		end_page_read(page, false, file_offset, sectorsize);
-		unlock_extent_cached_atomic(&BTRFS_I(ctrl->inode)->io_tree,
+			end_page_read(page, false, file_offset, sectorsize);
+			unlock_extent_cached_atomic(&BTRFS_I(ctrl->inode)->io_tree,
 				file_offset, file_offset + sectorsize - 1, NULL);
+		}
+		has_error = true;
 	}
 	kfree(ctrl->cur_bad_bitmap);
 	kfree(ctrl->prev_bad_bitmap);
@@ -3116,6 +3131,9 @@ static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 	ctrl->failed_bio = NULL;
 	ASSERT(bio_list_empty(&ctrl->bios));
 	ASSERT(atomic_read(&ctrl->io_bytes) == 0);
+	if (has_error)
+		return -EIO;
+	return 0;
 }
 
 static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
@@ -3165,14 +3183,14 @@ static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
 			goto next;
 		}
 
-		ret = read_repair_add_sector(inode, ctrl, failed_bio,
-					     bio_offset + offset,
-					     start + offset);
+		ret = btrfs_read_repair_add_sector(inode, ctrl, failed_bio,
+						   bio_offset + offset,
+						   start + offset, false);
 		if (!ret) {
 			/*
 			 * We have recorded the corrupted sector, will submit
 			 * read bios and handle the page status update later
-			 * in read_repair_finish().
+			 * in btrfs_read_repair_finish().
 			 */
 			continue;
 		}
@@ -3534,7 +3552,11 @@ static void end_bio_extent_readpage(struct bio *bio)
 	}
 	/* Release the last extent */
 	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
-	read_repair_finish(&ctrl);
+	/*
+	 * For buffered read, we don't care about the return value, as we just
+	 * mark all the corrupted sectors errors if repair failed.
+	 */
+	btrfs_read_repair_finish(&ctrl);
 	btrfs_bio_free_csum(bbio);
 	bio_put(bio);
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index e056ebc7ae01..f829e6899fad 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -153,8 +153,19 @@ struct btrfs_read_repair_ctrl {
 	 */
 	bool error;
 	bool initialized;
+
+	/*
+	 * For dio case, we just skip all page status and extent status
+	 * update.
+	 */
+	bool dio;
 };
 
+int btrfs_read_repair_add_sector(struct inode *inode,
+				 struct btrfs_read_repair_ctrl *ctrl,
+				 struct bio *failed_bio, u32 bio_offset,
+				 u64 file_offset, bool dio);
+int btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl);
 /*
  * Structure to record how many bytes and which ranges are set/cleared
  */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 355e559358a3..b7cd11e403ab 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7809,22 +7809,6 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	kfree(dip);
 }
 
-static void submit_dio_repair_bio(struct inode *inode, struct bio *bio,
-				  int mirror_num, unsigned long bio_flags)
-{
-	struct btrfs_dio_private *dip = bio->bi_private;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
-
-	if (btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA))
-		return;
-
-	refcount_inc(&dip->refs);
-	if (btrfs_map_bio(fs_info, bio, mirror_num))
-		refcount_dec(&dip->refs);
-}
-
 static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 					     struct btrfs_bio *bbio,
 					     const bool uptodate)
@@ -7835,10 +7819,12 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
+	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	struct bio_vec bvec;
 	struct bvec_iter iter;
 	u32 bio_offset = 0;
 	blk_status_t err = BLK_STS_OK;
+	int ret;
 
 	__bio_for_each_segment(bvec, &bbio->bio, iter, bbio->iter) {
 		unsigned int i, nr_sectors, pgoff;
@@ -7858,13 +7844,10 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 						 btrfs_ino(BTRFS_I(inode)),
 						 pgoff);
 			} else {
-				int ret;
-
-				ret = btrfs_repair_one_sector(inode, &bbio->bio,
-						bio_offset, bvec.bv_page, pgoff,
-						start, bbio->mirror_num,
-						submit_dio_repair_bio);
-				if (ret)
+				ret = btrfs_read_repair_add_sector(inode, &ctrl,
+						&bbio->bio, bio_offset, start,
+						true);
+				if (ret && !err)
 					err = errno_to_blk_status(ret);
 			}
 			ASSERT(bio_offset + sectorsize > bio_offset);
@@ -7872,6 +7855,9 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 			pgoff += sectorsize;
 		}
 	}
+	ret = btrfs_read_repair_finish(&ctrl);
+	if (ret && !err)
+		err = errno_to_blk_status(ret);
 	return err;
 }
 
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 10/12] btrfs: cleanup btrfs_repair_one_sector()
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (8 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 09/12] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-28 13:45   ` Christoph Hellwig
  2022-04-27  7:18 ` [PATCH RFC v2 11/12] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
  2022-04-27  7:18 ` [PATCH RFC v2 12/12] btrfs: remove btrfs_inode::io_failure_tree Qu Wenruo
  11 siblings, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

The function and its dependencies are no longer used any more, we can
clean them up.

Please note that, btrfs_repair_one_sector() is the only caller of
set_state_failrec() with a non-NULL pointer.

Although this patch leaves the other failure record related
infrastructures untouched, it's mostly for patch split.

This is the first step towards removing btrfs_inode::io_failure_tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 195 -------------------------------------------
 fs/btrfs/extent_io.h |   6 --
 2 files changed, 201 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 515de39e70f7..6f2faa77523d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2509,201 +2509,6 @@ void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
 	spin_unlock(&failure_tree->lock);
 }
 
-static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
-							     u64 start)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct io_failure_record *failrec;
-	struct extent_map *em;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	const u32 sectorsize = fs_info->sectorsize;
-	int ret;
-	u64 logical;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (!IS_ERR(failrec)) {
-		btrfs_debug(fs_info,
-	"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu",
-			failrec->logical, failrec->start, failrec->len);
-		/*
-		 * when data can be on disk more than twice, add to failrec here
-		 * (e.g. with a list for failed_mirror) to make
-		 * clean_io_failure() clean all those errors at once.
-		 */
-
-		return failrec;
-	}
-
-	failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
-	if (!failrec)
-		return ERR_PTR(-ENOMEM);
-
-	failrec->start = start;
-	failrec->len = sectorsize;
-	failrec->this_mirror = 0;
-	failrec->bio_flags = 0;
-
-	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree, start, failrec->len);
-	if (!em) {
-		read_unlock(&em_tree->lock);
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	if (em->start > start || em->start + em->len <= start) {
-		free_extent_map(em);
-		em = NULL;
-	}
-	read_unlock(&em_tree->lock);
-	if (!em) {
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	logical = start - em->start;
-	logical = em->block_start + logical;
-	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
-		logical = em->block_start;
-		failrec->bio_flags = EXTENT_BIO_COMPRESSED;
-		extent_set_compress_type(&failrec->bio_flags, em->compress_type);
-	}
-
-	btrfs_debug(fs_info,
-		    "Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu",
-		    logical, start, failrec->len);
-
-	failrec->logical = logical;
-	free_extent_map(em);
-
-	/* Set the bits in the private failure tree */
-	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
-			      EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret >= 0) {
-		ret = set_state_failrec(failure_tree, start, failrec);
-		/* Set the bits in the inode's tree */
-		ret = set_extent_bits(tree, start, start + sectorsize - 1,
-				      EXTENT_DAMAGED);
-	} else if (ret < 0) {
-		kfree(failrec);
-		return ERR_PTR(ret);
-	}
-
-	return failrec;
-}
-
-static bool btrfs_check_repairable(struct inode *inode,
-				   struct io_failure_record *failrec,
-				   int failed_mirror)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	int num_copies;
-
-	num_copies = btrfs_num_copies(fs_info, failrec->logical, failrec->len);
-	if (num_copies == 1) {
-		/*
-		 * we only have a single copy of the data, so don't bother with
-		 * all the retry and error correction code that follows. no
-		 * matter what the error is, it is very likely to persist.
-		 */
-		btrfs_debug(fs_info,
-			"Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	/* The failure record should only contain one sector */
-	ASSERT(failrec->len == fs_info->sectorsize);
-
-	/*
-	 * There are two premises:
-	 * a) deliver good data to the caller
-	 * b) correct the bad sectors on disk
-	 *
-	 * Since we're only doing repair for one sector, we only need to get
-	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for repair_io_failure to do the rest for us.
-	 */
-	ASSERT(failed_mirror);
-	failrec->failed_mirror = failed_mirror;
-	failrec->this_mirror++;
-	if (failrec->this_mirror == failed_mirror)
-		failrec->this_mirror++;
-
-	if (failrec->this_mirror > num_copies) {
-		btrfs_debug(fs_info,
-			"Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	return true;
-}
-
-int btrfs_repair_one_sector(struct inode *inode,
-			    struct bio *failed_bio, u32 bio_offset,
-			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook)
-{
-	struct io_failure_record *failrec;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct btrfs_bio *failed_bbio = btrfs_bio(failed_bio);
-	const int icsum = bio_offset >> fs_info->sectorsize_bits;
-	struct bio *repair_bio;
-	struct btrfs_bio *repair_bbio;
-
-	btrfs_debug(fs_info,
-		   "repair read error: read error at %llu", start);
-
-	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
-
-	failrec = btrfs_get_io_failure_record(inode, start);
-	if (IS_ERR(failrec))
-		return PTR_ERR(failrec);
-
-
-	if (!btrfs_check_repairable(inode, failrec, failed_mirror)) {
-		free_io_failure(failure_tree, tree, failrec);
-		return -EIO;
-	}
-
-	repair_bio = btrfs_bio_alloc(1);
-	repair_bbio = btrfs_bio(repair_bio);
-	repair_bbio->file_offset = start;
-	repair_bio->bi_opf = REQ_OP_READ;
-	repair_bio->bi_end_io = failed_bio->bi_end_io;
-	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
-	repair_bio->bi_private = failed_bio->bi_private;
-
-	if (failed_bbio->csum) {
-		const u32 csum_size = fs_info->csum_size;
-
-		repair_bbio->csum = repair_bbio->csum_inline;
-		memcpy(repair_bbio->csum,
-		       failed_bbio->csum + csum_size * icsum, csum_size);
-	}
-
-	bio_add_page(repair_bio, page, failrec->len, pgoff);
-	repair_bbio->iter = repair_bio->bi_iter;
-
-	btrfs_debug(btrfs_sb(inode->i_sb),
-		    "repair read error: submitting new read to mirror %d",
-		    failrec->this_mirror);
-
-	/*
-	 * At this point we have a bio, so any errors from submit_bio_hook()
-	 * will be handled by the endio on the repair_bio, so we can't return an
-	 * error here.
-	 */
-	submit_bio_hook(inode, repair_bio, failrec->this_mirror, failrec->bio_flags);
-	return BLK_STS_OK;
-}
-
 static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index f829e6899fad..5e76df4a832b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -353,12 +353,6 @@ struct io_failure_record {
 	int failed_mirror;
 };
 
-int btrfs_repair_one_sector(struct inode *inode,
-			    struct bio *failed_bio, u32 bio_offset,
-			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook);
-
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
 			     struct page *locked_page, u64 *start,
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 11/12] btrfs: remove io_failure_record infrastructure completely
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (9 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 10/12] btrfs: cleanup btrfs_repair_one_sector() Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  2022-04-27  7:18 ` [PATCH RFC v2 12/12] btrfs: remove btrfs_inode::io_failure_tree Qu Wenruo
  11 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

Since our read repair are always handled by btrfs_read_repair_ctrl,
which only has the lifespan inside endio function.

This means we no longer needs to record which range and its mirror
number for failure.

Now if we failed to read some data page, we have already tried every
mirrors we have, thus no need to record the failed range.

Thus this patch can remove the whole io_failure_record structure and its
related functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-io-tree.h |  14 ---
 fs/btrfs/extent_io.c      | 179 --------------------------------------
 fs/btrfs/extent_io.h      |  19 ----
 fs/btrfs/inode.c          |  19 +---
 4 files changed, 4 insertions(+), 227 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index c3eb52dbe61c..d46c064e5dad 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -250,18 +250,4 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 			       u64 *end, u64 max_bytes,
 			       struct extent_state **cached_state);
 
-/* This should be reworked in the future and put elsewhere. */
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start);
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec);
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start,
-		u64 end);
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec);
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset);
-
 #endif /* BTRFS_EXTENT_IO_TREE_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6f2faa77523d..92cd0c0851cd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2168,66 +2168,6 @@ u64 count_range_bits(struct extent_io_tree *tree,
 	return total_bytes;
 }
 
-/*
- * set the private field for a given byte offset in the tree.  If there isn't
- * an extent_state there already, this does nothing.
- */
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	int ret = 0;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state->failrec = failrec;
-out:
-	spin_unlock(&tree->lock);
-	return ret;
-}
-
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	struct io_failure_record *failrec;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-
-	failrec = state->failrec;
-out:
-	spin_unlock(&tree->lock);
-	return failrec;
-}
-
 /*
  * searches a range in the state tree for a given mask.
  * If 'filled' == 1, this returns 1 only if every extent in the tree
@@ -2284,30 +2224,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec)
-{
-	int ret;
-	int err = 0;
-
-	set_state_failrec(failure_tree, rec->start, NULL);
-	ret = clear_extent_bits(failure_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret)
-		err = ret;
-
-	ret = clear_extent_bits(io_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_DAMAGED);
-	if (ret && !err)
-		err = ret;
-
-	kfree(rec);
-	return err;
-}
-
 /*
  * this bypasses the standard btrfs submit functions deliberately, as
  * the standard behavior is to write all copies in a raid setup. here we only
@@ -2422,93 +2338,6 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 	return ret;
 }
 
-/*
- * each time an IO finishes, we do a fast check in the IO failure tree
- * to see if we need to process or clean up an io_failure_record
- */
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset)
-{
-	u64 private;
-	struct io_failure_record *failrec;
-	struct extent_state *state;
-	int num_copies;
-	int ret;
-
-	private = 0;
-	ret = count_range_bits(failure_tree, &private, (u64)-1, 1,
-			       EXTENT_DIRTY, 0);
-	if (!ret)
-		return 0;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (IS_ERR(failrec))
-		return 0;
-
-	BUG_ON(!failrec->this_mirror);
-
-	if (sb_rdonly(fs_info->sb))
-		goto out;
-
-	spin_lock(&io_tree->lock);
-	state = find_first_extent_bit_state(io_tree,
-					    failrec->start,
-					    EXTENT_LOCKED);
-	spin_unlock(&io_tree->lock);
-
-	if (state && state->start <= failrec->start &&
-	    state->end >= failrec->start + failrec->len - 1) {
-		num_copies = btrfs_num_copies(fs_info, failrec->logical,
-					      failrec->len);
-		if (num_copies > 1)  {
-			repair_io_failure(fs_info, ino, start, failrec->len,
-					  failrec->logical, page, pg_offset,
-					  failrec->failed_mirror);
-		}
-	}
-
-out:
-	free_io_failure(failure_tree, io_tree, failrec);
-
-	return 0;
-}
-
-/*
- * Can be called when
- * - hold extent lock
- * - under ordered extent
- * - the inode is freeing
- */
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
-{
-	struct extent_io_tree *failure_tree = &inode->io_failure_tree;
-	struct io_failure_record *failrec;
-	struct extent_state *state, *next;
-
-	if (RB_EMPTY_ROOT(&failure_tree->state))
-		return;
-
-	spin_lock(&failure_tree->lock);
-	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
-	while (state) {
-		if (state->start > end)
-			break;
-
-		ASSERT(state->end <= end);
-
-		next = next_state(state);
-
-		failrec = state->failrec;
-		free_extent_state(state);
-		kfree(failrec);
-
-		state = next;
-	}
-	spin_unlock(&failure_tree->lock);
-}
-
 static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
@@ -3225,7 +3054,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	struct bio_vec *bvec;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
-	struct extent_io_tree *tree, *failure_tree;
 	struct processed_extent processed = { 0 };
 	/*
 	 * The offset to the beginning of a bio, since one bio can never be
@@ -3252,8 +3080,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			"end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u",
 			bio->bi_iter.bi_sector, bio->bi_status,
 			bbio->mirror_num);
-		tree = &BTRFS_I(inode)->io_tree;
-		failure_tree = &BTRFS_I(inode)->io_failure_tree;
 
 		/*
 		 * We always issue full-sector reads, but if some block in a
@@ -3288,11 +3114,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			}
 			if (ret)
 				uptodate = false;
-			else
-				clean_io_failure(BTRFS_I(inode)->root->fs_info,
-						 failure_tree, tree, start,
-						 page,
-						 btrfs_ino(BTRFS_I(inode)), 0);
 		}
 
 		if (likely(uptodate))
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e76df4a832b..77f4acd6ec30 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -70,7 +70,6 @@ struct btrfs_root;
 struct btrfs_inode;
 struct btrfs_io_bio;
 struct btrfs_fs_info;
-struct io_failure_record;
 struct extent_io_tree;
 
 typedef void (submit_bio_hook_t)(struct inode *inode, struct bio *bio,
@@ -335,24 +334,6 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size);
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the sector is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-	struct page *page;
-	u64 start;
-	u64 len;
-	u64 logical;
-	unsigned long bio_flags;
-	int this_mirror;
-	int failed_mirror;
-};
-
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
 			     struct page *locked_page, u64 *start,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b7cd11e403ab..99408618af1c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3115,8 +3115,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 					ordered_extent->disk_num_bytes);
 	}
 
-	btrfs_free_io_failure_record(inode, start, end);
-
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
 		logical_len = ordered_extent->truncated_len;
@@ -5335,8 +5333,6 @@ void btrfs_evict_inode(struct inode *inode)
 	if (is_bad_inode(inode))
 		goto no_delete;
 
-	btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
-
 	if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
 		goto no_delete;
 
@@ -7816,8 +7812,6 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 	struct inode *inode = dip->inode;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	const u32 sectorsize = fs_info->sectorsize;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	struct bio_vec bvec;
@@ -7835,15 +7829,10 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 			u64 start = bbio->file_offset + bio_offset;
 
 			ASSERT(pgoff < PAGE_SIZE);
-			if (uptodate &&
-			    (!csum || !check_data_csum(inode, bbio,
-						       bio_offset, bvec.bv_page,
-						       pgoff, start))) {
-				clean_io_failure(fs_info, failure_tree, io_tree,
-						 start, bvec.bv_page,
-						 btrfs_ino(BTRFS_I(inode)),
-						 pgoff);
-			} else {
+			/* Either we hit a read failure, or the csum mismatch */
+			if (!uptodate || (csum && check_data_csum(inode, bbio,
+							bio_offset, bvec.bv_page,
+							pgoff, start))) {
 				ret = btrfs_read_repair_add_sector(inode, &ctrl,
 						&bbio->bio, bio_offset, start,
 						true);
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC v2 12/12] btrfs: remove btrfs_inode::io_failure_tree
       [not found] <cover.1651043617.git.wqu@suse.com>
                   ` (10 preceding siblings ...)
  2022-04-27  7:18 ` [PATCH RFC v2 11/12] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
@ 2022-04-27  7:18 ` Qu Wenruo
  11 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-27  7:18 UTC (permalink / raw)
  To: linux-btrfs

Since we're handling all data read error inside the endio function,
there is no need to record any error in io_failure_tree.

Let remove btrfs_inode::io_failure_tree completely.

Although we have extra memory usage for btrfs_read_repair_ctrl, its
lifespan is only in endio, while the io_failure_tree has a lifespan as
long as each inode.
Thus this should bring a overall memory usage reduce.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/btrfs_inode.h       | 5 -----
 fs/btrfs/extent-io-tree.h    | 1 -
 fs/btrfs/inode.c             | 3 ---
 include/trace/events/btrfs.h | 1 -
 4 files changed, 10 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 32131a5d321b..2ff4da4ed2a1 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -91,11 +91,6 @@ struct btrfs_inode {
 	/* the io_tree does range state (DIRTY, LOCKED etc) */
 	struct extent_io_tree io_tree;
 
-	/* special utility tree used to record which mirrors have already been
-	 * tried when checksums fail for a given block
-	 */
-	struct extent_io_tree io_failure_tree;
-
 	/*
 	 * Keep track of where the inode has extent items mapped in order to
 	 * make sure the i_size adjustments are accurate
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index d46c064e5dad..8ab9b6cd53ed 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -56,7 +56,6 @@ enum {
 	IO_TREE_FS_EXCLUDED_EXTENTS,
 	IO_TREE_BTREE_INODE_IO,
 	IO_TREE_INODE_IO,
-	IO_TREE_INODE_IO_FAILURE,
 	IO_TREE_RELOC_BLOCKS,
 	IO_TREE_TRANS_DIRTY_PAGES,
 	IO_TREE_ROOT_DIRTY_LOG_PAGES,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 99408618af1c..4351ee995a52 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8822,12 +8822,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	inode = &ei->vfs_inode;
 	extent_map_tree_init(&ei->extent_tree);
 	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO, inode);
-	extent_io_tree_init(fs_info, &ei->io_failure_tree,
-			    IO_TREE_INODE_IO_FAILURE, inode);
 	extent_io_tree_init(fs_info, &ei->file_extent_tree,
 			    IO_TREE_INODE_FILE_EXTENT, inode);
 	ei->io_tree.track_uptodate = true;
-	ei->io_failure_tree.track_uptodate = true;
 	atomic_set(&ei->sync_writers, 0);
 	mutex_init(&ei->log_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index f068ff30d654..020ca1f7687a 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -82,7 +82,6 @@ struct btrfs_space_info;
 	EM( IO_TREE_FS_EXCLUDED_EXTENTS,  "EXCLUDED_EXTENTS")	    \
 	EM( IO_TREE_BTREE_INODE_IO,	  "BTREE_INODE_IO")	    \
 	EM( IO_TREE_INODE_IO,		  "INODE_IO")		    \
-	EM( IO_TREE_INODE_IO_FAILURE,	  "INODE_IO_FAILURE")	    \
 	EM( IO_TREE_RELOC_BLOCKS,	  "RELOC_BLOCKS")	    \
 	EM( IO_TREE_TRANS_DIRTY_PAGES,	  "TRANS_DIRTY_PAGES")      \
 	EM( IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES")   \
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine
  2022-04-27  7:18 ` [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine Qu Wenruo
@ 2022-04-27 13:59   ` kernel test robot
  2022-04-28 10:08   ` kernel test robot
  1 sibling, 0 replies; 38+ messages in thread
From: kernel test robot @ 2022-04-27 13:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: llvm, kbuild-all

Hi Qu,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on kdave/for-next]
[also build test WARNING on next-20220427]
[cannot apply to rostedt-trace/for-next v5.18-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-introduce-a-pure-data-checksum-checking-helper/20220427-161943
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: hexagon-randconfig-r041-20220427 (https://download.01.org/0day-ci/archive/20220427/202204272153.rxJ2vPvg-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 1cddcfdc3c683b393df1a5c9063252eb60e52818)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/3f389ea1be2d5c290c4b523743ca200983f45765
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-introduce-a-pure-data-checksum-checking-helper/20220427-161943
        git checkout 3f389ea1be2d5c290c4b523743ca200983f45765
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash fs/btrfs/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/extent_io.c:3078:3: warning: variable 'nbits' is used uninitialized whenever '?:' condition is true [-Wsometimes-uninitialized]
                   ASSERT(atomic_read(&ctrl->io_bytes) == 0);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/ctree.h:3620:3: note: expanded from macro 'ASSERT'
           (likely(expr) ? (void)0 : assertfail(#expr, __FILE__, __LINE__))
            ^~~~~~~~~~~~
   include/linux/compiler.h:77:20: note: expanded from macro 'likely'
   # define likely(x)      __builtin_expect(!!(x), 1)
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/extent_io.c:3098:46: note: uninitialized use occurs here
           for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
                                                       ^~~~~
   include/linux/find.h:283:38: note: expanded from macro 'for_each_set_bit'
           for ((bit) = find_next_bit((addr), (size), 0);          \
                                               ^~~~
   fs/btrfs/extent_io.c:3078:3: note: remove the '?:' if its condition is always false
                   ASSERT(atomic_read(&ctrl->io_bytes) == 0);
                   ^
   fs/btrfs/ctree.h:3620:3: note: expanded from macro 'ASSERT'
           (likely(expr) ? (void)0 : assertfail(#expr, __FILE__, __LINE__))
            ^
   include/linux/compiler.h:77:20: note: expanded from macro 'likely'
   # define likely(x)      __builtin_expect(!!(x), 1)
                           ^
   fs/btrfs/extent_io.c:3064:20: note: initialize the variable 'nbits' to silence this warning
           unsigned int nbits;
                             ^
                              = 0
   1 warning generated.


vim +3078 fs/btrfs/extent_io.c

  3060	
  3061	static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
  3062	{
  3063		struct btrfs_fs_info *fs_info;
  3064		unsigned int nbits;
  3065		u32 sectorsize;
  3066		int bit;
  3067		int i;
  3068	
  3069		if (!ctrl->initialized)
  3070			return;
  3071	
  3072		/*
  3073		 * Got a critical -ENOMEM error preivously, no repair should have been
  3074		 * attempted.
  3075		 */
  3076		if (ctrl->error) {
  3077			ASSERT(bio_list_empty(&ctrl->bios));
> 3078			ASSERT(atomic_read(&ctrl->io_bytes) == 0);
  3079			goto mark_error;
  3080		}
  3081	
  3082		ASSERT(ctrl->inode);
  3083		fs_info = btrfs_sb(ctrl->inode->i_sb);
  3084		nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
  3085		sectorsize = fs_info->sectorsize;
  3086	
  3087		/* Go through each remaining mirrors to do the repair */
  3088		for (i = get_next_mirror(ctrl->init_mirror, ctrl->num_copies);
  3089		     i != ctrl->init_mirror; i = get_next_mirror(i, ctrl->num_copies)) {
  3090			read_repair_from_one_mirror(ctrl, ctrl->inode, i);
  3091	
  3092			/* Check the error bitmap to see if no more corrupted sectors */
  3093			if (bitmap_all_zero(ctrl->cur_bad_bitmap, nbits))
  3094				break;
  3095		}
  3096	mark_error:
  3097		/* Finish the unrecovered bad sectors */
  3098		for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
  3099			struct page *page;
  3100			unsigned int pgoff;
  3101			u64 file_offset = (bit << fs_info->sectorsize_bits) +
  3102					  ctrl->file_offset;
  3103	
  3104			page = read_repair_get_sector(ctrl, bit, &pgoff);
  3105	
  3106			end_page_read(page, false, file_offset, sectorsize);
  3107			unlock_extent_cached_atomic(&BTRFS_I(ctrl->inode)->io_tree,
  3108					file_offset, file_offset + sectorsize - 1, NULL);
  3109		}
  3110		kfree(ctrl->cur_bad_bitmap);
  3111		kfree(ctrl->prev_bad_bitmap);
  3112		ctrl->cur_bad_bitmap = NULL;
  3113		ctrl->prev_bad_bitmap = NULL;
  3114		ctrl->initialized = false;
  3115		ctrl->error = false;
  3116		ctrl->failed_bio = NULL;
  3117		ASSERT(bio_list_empty(&ctrl->bios));
  3118		ASSERT(atomic_read(&ctrl->io_bytes) == 0);
  3119	}
  3120	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting
  2022-04-27  7:18 ` [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting Qu Wenruo
@ 2022-04-28  5:16   ` Qu Wenruo
  2022-04-28 13:32   ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-28  5:16 UTC (permalink / raw)
  To: linux-btrfs



On 2022/4/27 15:18, Qu Wenruo wrote:
> Lower level bio mapping, from driver to dm, even btrfs chunk mapping
> can modify bio::bi_iter.
> 
> This prevents us from doing two things:
> 
> - Iterate the bio range of a cloned bio
>    This is only utilized by direct IO, thus it's already using
>    btrfs_bio::iter, which is populated in btrfs_bio_clone_partial().
> 
> - Grab the original logical bytenr of a bio
>    This will be utilized by incoming read repair patches.
> 
> So to make sure all btrfs_bio submitted to own a proper iter, this patch
> will assigned btrfs_bio::iter in the following call sites:
> 
> - btrfs_submit_dio_bio()
> - submit_one_bio()
> - submit_compressed_bio()
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Cc Christoph, this is definitely going to conflicts with his further 
cleanups.

Maybe I can update this patch to only set the iter for read bios?

Thanks,
Qu

> ---
>   fs/btrfs/compression.c | 2 ++
>   fs/btrfs/extent_io.c   | 6 ++++++
>   fs/btrfs/inode.c       | 2 ++
>   3 files changed, 10 insertions(+)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 69c060dc024c..559bf53beaed 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -428,6 +428,8 @@ static blk_status_t submit_compressed_bio(struct btrfs_fs_info *fs_info,
>   	blk_status_t ret;
>   
>   	ASSERT(bio->bi_iter.bi_size);
> +	/* Check submit_one_bio() for the reason */
> +	btrfs_bio(bio)->iter = bio->bi_iter;
>   	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
>   	if (ret)
>   		return ret;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 07888cce3bce..a3962df30603 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -186,6 +186,12 @@ static void submit_one_bio(struct bio *bio, int mirror_num, unsigned long bio_fl
>   	/* Caller should ensure the bio has at least some range added */
>   	ASSERT(bio->bi_iter.bi_size);
>   
> +	/*
> +	 * Save current bi_iter into btrfs_bio::iter, as lower level (including
> +	 * btrfs chunk mapping) can modify bi_iter, preventing us from using
> +	 * bi_iter to iterate cloned bio or grab the original logical bytenr.
> +	 */
> +	btrfs_bio(bio)->iter = bio->bi_iter;
>   	if (is_data_inode(tree->private_data))
>   		btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
>   					    bio_flags);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b4cfd78f50e8..1b596de0c4e9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7923,6 +7923,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
>   	bool write = btrfs_op(bio) == BTRFS_MAP_WRITE;
>   	blk_status_t ret;
>   
> +	btrfs_bio(bio)->iter = bio->bi_iter;
> +
>   	/* Check btrfs_submit_bio_hook() for rules about async submit. */
>   	if (async_submit)
>   		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-04-27  7:18 ` [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
@ 2022-04-28  5:20   ` Qu Wenruo
  2022-04-28 13:44     ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-04-28  5:20 UTC (permalink / raw)
  To: linux-btrfs, Christoph Hellwig



On 2022/4/27 15:18, Qu Wenruo wrote:
> The new helper, read_repair_bio_add_sector(), will grab the page and
> page_offset, and queue the sector into
> btrfs_read_repair_ctrl::read_bios for later usage.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 107 +++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/extent_io.h |   6 +++
>   2 files changed, 113 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6304f694c8d6..fbed78ffe8e1 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2732,6 +2732,110 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
>   		btrfs_subpage_end_reader(fs_info, page, start, len);
>   }
>   
> +static struct page *read_repair_get_sector(struct btrfs_read_repair_ctrl *ctrl,
> +					   int sector_nr, unsigned int *pgoff)
> +{
> +	const struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
> +	const u32 target_offset = sector_nr << fs_info->sectorsize_bits;
> +	struct bvec_iter iter;
> +	struct bio_vec bvec;
> +	u32 offset = 0;
> +
> +	ASSERT(pgoff);
> +	ASSERT((sector_nr << fs_info->sectorsize_bits) < ctrl->bio_size);
> +
> +	/*
> +	 * This is definitely not effecient, but I don't have better way
> +	 * to grab a specified bvec from a bio directly.
> +	 */

Also Cc to Christoph.

This function will get called very frequently, and I really want to 
avoid doing such re-search every time from the beginning of the original 
bio.

Maybe we can cache a bvec_iter and using the bi_size to check if the 
target offset is still beyond us (then advance), or re-wind and 
re-search from the beginning.


I guess there is no existing helper to do the same work, right?

Thanks,
Qu

> +	__bio_for_each_segment(bvec, ctrl->failed_bio, iter,
> +			       btrfs_bio(ctrl->failed_bio)->iter) {
> +		if (target_offset - offset < bvec.bv_len) {
> +			*pgoff = bvec.bv_offset + (target_offset - offset);
> +			return bvec.bv_page;
> +		}
> +		offset += bvec.bv_len;
> +	}
> +	return NULL;
> +}
> +
> +static void read_repair_end_bio(struct bio *bio)
> +{
> +	struct btrfs_read_repair_ctrl *ctrl = bio->bi_private;
> +	const struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
> +	struct bvec_iter_all iter_all;
> +	struct bio_vec *bvec;
> +	u64 logical = btrfs_bio(bio)->iter.bi_sector << SECTOR_SHIFT;
> +	u32 offset = 0;
> +	bool uptodate = (bio->bi_status == BLK_STS_OK);
> +
> +	/* We should not have csum in bbio */
> +	ASSERT(!btrfs_bio(bio)->csum);
> +	bio_for_each_segment_all(bvec, bio, iter_all) {
> +		/*
> +		 * If we have a successful read, clear the error bit.
> +		 * In read_repair_finish(), we will re-check the csum
> +		 * (if exists) later.
> +		 */
> +		if (uptodate)
> +			clear_bit((logical + offset - ctrl->logical) >>
> +				  fs_info->sectorsize_bits,
> +				  ctrl->cur_bad_bitmap);
> +		atomic_sub(bvec->bv_len, &ctrl->io_bytes);
> +		wake_up(&ctrl->io_wait);
> +		offset += bvec->bv_len;
> +	}
> +	bio_put(bio);
> +}
> +
> +/* Add a sector into the read repair bios list for later submission */
> +static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
> +				       int sector_nr)
> +{
> +	const struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
> +	struct page *page;
> +	int pgoff;
> +	struct bio *bio;
> +	int ret;
> +
> +	page = read_repair_get_sector(ctrl, sector_nr, &pgoff);
> +	ASSERT(page);
> +
> +	/* Check if the sector can be added to the last bio */
> +	if (!bio_list_empty(&ctrl->read_bios)) {
> +		bio = ctrl->read_bios.tail;
> +		if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) + bio->bi_iter.bi_size ==
> +		    ctrl->logical + (sector_nr << fs_info->sectorsize_bits))
> +			goto add;
> +	}
> +	/*
> +	 * Here we want to know the logical bytenr at endio time, so we can
> +	 * update the bitmap.
> +	 * Unfortunately our bi_private will be used, and bi_iter is not
> +	 * reliable, thus we have to alloc btrfs_bio, even we just want
> +	 * logical bytenr.
> +	 */
> +	bio = btrfs_bio_alloc(BIO_MAX_VECS);
> +	/* It's backed by mempool, thus should not fail */
> +	ASSERT(bio);
> +
> +	bio->bi_opf = REQ_OP_READ;
> +	bio->bi_iter.bi_sector = ((sector_nr << fs_info->sectorsize_bits) +
> +				  ctrl->logical) >> SECTOR_SHIFT;
> +	bio->bi_private = ctrl;
> +	bio->bi_end_io = read_repair_end_bio;
> +	bio_list_add(&ctrl->read_bios, bio);
> +
> +add:
> +	ret = bio_add_page(bio, page, fs_info->sectorsize, pgoff);
> +	/*
> +	 * We allocated the read bio with enough bvecs to contain
> +	 * the original bio, thus it should not fail to add a sector.
> +	 */
> +	ASSERT(ret == fs_info->sectorsize);
> +	atomic_add(fs_info->sectorsize, &ctrl->io_bytes);
> +}
> +
>   static int read_repair_add_sector(struct inode *inode,
>   				  struct btrfs_read_repair_ctrl *ctrl,
>   				  struct bio *failed_bio, u32 bio_offset)
> @@ -2762,6 +2866,9 @@ static int read_repair_add_sector(struct inode *inode,
>   		ctrl->init_mirror = btrfs_bio(failed_bio)->mirror_num;
>   		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
>   						    sectorsize);
> +		init_waitqueue_head(&ctrl->io_wait);
> +		bio_list_init(&ctrl->read_bios);
> +		atomic_set(&ctrl->io_bytes, 0);
>   
>   		ctrl->cur_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
>   					fs_info->sectorsize_bits, GFP_NOFS);
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index eff008ba194f..4904229ee73a 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -121,6 +121,12 @@ struct btrfs_read_repair_ctrl {
>   	 */
>   	unsigned long *prev_bad_bitmap;
>   
> +	struct bio_list read_bios;
> +
> +	wait_queue_head_t io_wait;
> +
> +	atomic_t io_bytes;
> +
>   	/* The logical bytenr of the original bio. */
>   	u64 logical;
>   


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine
  2022-04-27  7:18 ` [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine Qu Wenruo
  2022-04-27 13:59   ` kernel test robot
@ 2022-04-28 10:08   ` kernel test robot
  2022-04-28 10:51       ` Qu Wenruo
  1 sibling, 1 reply; 38+ messages in thread
From: kernel test robot @ 2022-04-28 10:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: llvm, kbuild-all

Hi Qu,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on kdave/for-next]
[also build test WARNING on next-20220428]
[cannot apply to rostedt-trace/for-next v5.18-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-introduce-a-pure-data-checksum-checking-helper/20220427-161943
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: arm-randconfig-c002-20220428 (https://download.01.org/0day-ci/archive/20220428/202204281717.jzlTSpQj-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c59473aacce38cd7dd77eebceaf3c98c5707ab3b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/3f389ea1be2d5c290c4b523743ca200983f45765
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-introduce-a-pure-data-checksum-checking-helper/20220427-161943
        git checkout 3f389ea1be2d5c290c4b523743ca200983f45765
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash fs/btrfs/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/extent_io.c:3076:6: warning: variable 'nbits' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
           if (ctrl->error) {
               ^~~~~~~~~~~
   fs/btrfs/extent_io.c:3098:46: note: uninitialized use occurs here
           for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
                                                       ^~~~~
   include/linux/find.h:284:16: note: expanded from macro 'for_each_set_bit'
                (bit) < (size);                                    \
                         ^~~~
   fs/btrfs/extent_io.c:3076:2: note: remove the 'if' if its condition is always false
           if (ctrl->error) {
           ^~~~~~~~~~~~~~~~~~
   fs/btrfs/extent_io.c:3064:20: note: initialize the variable 'nbits' to silence this warning
           unsigned int nbits;
                             ^
                              = 0
   1 warning generated.


vim +3076 fs/btrfs/extent_io.c

  3060	
  3061	static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
  3062	{
  3063		struct btrfs_fs_info *fs_info;
  3064		unsigned int nbits;
  3065		u32 sectorsize;
  3066		int bit;
  3067		int i;
  3068	
  3069		if (!ctrl->initialized)
  3070			return;
  3071	
  3072		/*
  3073		 * Got a critical -ENOMEM error preivously, no repair should have been
  3074		 * attempted.
  3075		 */
> 3076		if (ctrl->error) {
  3077			ASSERT(bio_list_empty(&ctrl->bios));
  3078			ASSERT(atomic_read(&ctrl->io_bytes) == 0);
  3079			goto mark_error;
  3080		}
  3081	
  3082		ASSERT(ctrl->inode);
  3083		fs_info = btrfs_sb(ctrl->inode->i_sb);
  3084		nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
  3085		sectorsize = fs_info->sectorsize;
  3086	
  3087		/* Go through each remaining mirrors to do the repair */
  3088		for (i = get_next_mirror(ctrl->init_mirror, ctrl->num_copies);
  3089		     i != ctrl->init_mirror; i = get_next_mirror(i, ctrl->num_copies)) {
  3090			read_repair_from_one_mirror(ctrl, ctrl->inode, i);
  3091	
  3092			/* Check the error bitmap to see if no more corrupted sectors */
  3093			if (bitmap_all_zero(ctrl->cur_bad_bitmap, nbits))
  3094				break;
  3095		}
  3096	mark_error:
  3097		/* Finish the unrecovered bad sectors */
  3098		for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
  3099			struct page *page;
  3100			unsigned int pgoff;
  3101			u64 file_offset = (bit << fs_info->sectorsize_bits) +
  3102					  ctrl->file_offset;
  3103	
  3104			page = read_repair_get_sector(ctrl, bit, &pgoff);
  3105	
  3106			end_page_read(page, false, file_offset, sectorsize);
  3107			unlock_extent_cached_atomic(&BTRFS_I(ctrl->inode)->io_tree,
  3108					file_offset, file_offset + sectorsize - 1, NULL);
  3109		}
  3110		kfree(ctrl->cur_bad_bitmap);
  3111		kfree(ctrl->prev_bad_bitmap);
  3112		ctrl->cur_bad_bitmap = NULL;
  3113		ctrl->prev_bad_bitmap = NULL;
  3114		ctrl->initialized = false;
  3115		ctrl->error = false;
  3116		ctrl->failed_bio = NULL;
  3117		ASSERT(bio_list_empty(&ctrl->bios));
  3118		ASSERT(atomic_read(&ctrl->io_bytes) == 0);
  3119	}
  3120	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine
  2022-04-28 10:08   ` kernel test robot
@ 2022-04-28 10:51       ` Qu Wenruo
  0 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-28 10:51 UTC (permalink / raw)
  To: kernel test robot; +Cc: llvm, kbuild-all

#syz test https://github.com/adam900710/linux read_repair

On 2022/4/28 18:08, kernel test robot wrote:
> Hi Qu,
> 
> [FYI, it's a private test report for your RFC patch.]
> [auto build test WARNING on kdave/for-next]
> [also build test WARNING on next-20220428]
> [cannot apply to rostedt-trace/for-next v5.18-rc4]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-introduce-a-pure-data-checksum-checking-helper/20220427-161943
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> config: arm-randconfig-c002-20220428 (https://download.01.org/0day-ci/archive/20220428/202204281717.jzlTSpQj-lkp@intel.com/config)
> compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c59473aacce38cd7dd77eebceaf3c98c5707ab3b)
> reproduce (this is a W=1 build):
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # install arm cross compiling tool for clang build
>          # apt-get install binutils-arm-linux-gnueabi
>          # https://github.com/intel-lab-lkp/linux/commit/3f389ea1be2d5c290c4b523743ca200983f45765
>          git remote add linux-review https://github.com/intel-lab-lkp/linux
>          git fetch --no-tags linux-review Qu-Wenruo/btrfs-introduce-a-pure-data-checksum-checking-helper/20220427-161943
>          git checkout 3f389ea1be2d5c290c4b523743ca200983f45765
>          # save the config file
>          mkdir build_dir && cp config build_dir/.config
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash fs/btrfs/
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>):
> 
>>> fs/btrfs/extent_io.c:3076:6: warning: variable 'nbits' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>             if (ctrl->error) {
>                 ^~~~~~~~~~~
>     fs/btrfs/extent_io.c:3098:46: note: uninitialized use occurs here
>             for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
>                                                         ^~~~~
>     include/linux/find.h:284:16: note: expanded from macro 'for_each_set_bit'
>                  (bit) < (size);                                    \
>                           ^~~~
>     fs/btrfs/extent_io.c:3076:2: note: remove the 'if' if its condition is always false
>             if (ctrl->error) {
>             ^~~~~~~~~~~~~~~~~~
>     fs/btrfs/extent_io.c:3064:20: note: initialize the variable 'nbits' to silence this warning
>             unsigned int nbits;
>                               ^
>                                = 0
>     1 warning generated.
> 
> 
> vim +3076 fs/btrfs/extent_io.c
> 
>    3060	
>    3061	static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
>    3062	{
>    3063		struct btrfs_fs_info *fs_info;
>    3064		unsigned int nbits;
>    3065		u32 sectorsize;
>    3066		int bit;
>    3067		int i;
>    3068	
>    3069		if (!ctrl->initialized)
>    3070			return;
>    3071	
>    3072		/*
>    3073		 * Got a critical -ENOMEM error preivously, no repair should have been
>    3074		 * attempted.
>    3075		 */
>> 3076		if (ctrl->error) {
>    3077			ASSERT(bio_list_empty(&ctrl->bios));
>    3078			ASSERT(atomic_read(&ctrl->io_bytes) == 0);
>    3079			goto mark_error;
>    3080		}
>    3081	
>    3082		ASSERT(ctrl->inode);
>    3083		fs_info = btrfs_sb(ctrl->inode->i_sb);
>    3084		nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
>    3085		sectorsize = fs_info->sectorsize;
>    3086	
>    3087		/* Go through each remaining mirrors to do the repair */
>    3088		for (i = get_next_mirror(ctrl->init_mirror, ctrl->num_copies);
>    3089		     i != ctrl->init_mirror; i = get_next_mirror(i, ctrl->num_copies)) {
>    3090			read_repair_from_one_mirror(ctrl, ctrl->inode, i);
>    3091	
>    3092			/* Check the error bitmap to see if no more corrupted sectors */
>    3093			if (bitmap_all_zero(ctrl->cur_bad_bitmap, nbits))
>    3094				break;
>    3095		}
>    3096	mark_error:
>    3097		/* Finish the unrecovered bad sectors */
>    3098		for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
>    3099			struct page *page;
>    3100			unsigned int pgoff;
>    3101			u64 file_offset = (bit << fs_info->sectorsize_bits) +
>    3102					  ctrl->file_offset;
>    3103	
>    3104			page = read_repair_get_sector(ctrl, bit, &pgoff);
>    3105	
>    3106			end_page_read(page, false, file_offset, sectorsize);
>    3107			unlock_extent_cached_atomic(&BTRFS_I(ctrl->inode)->io_tree,
>    3108					file_offset, file_offset + sectorsize - 1, NULL);
>    3109		}
>    3110		kfree(ctrl->cur_bad_bitmap);
>    3111		kfree(ctrl->prev_bad_bitmap);
>    3112		ctrl->cur_bad_bitmap = NULL;
>    3113		ctrl->prev_bad_bitmap = NULL;
>    3114		ctrl->initialized = false;
>    3115		ctrl->error = false;
>    3116		ctrl->failed_bio = NULL;
>    3117		ASSERT(bio_list_empty(&ctrl->bios));
>    3118		ASSERT(atomic_read(&ctrl->io_bytes) == 0);
>    3119	}
>    3120	
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine
@ 2022-04-28 10:51       ` Qu Wenruo
  0 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-28 10:51 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5464 bytes --]

#syz test https://github.com/adam900710/linux read_repair

On 2022/4/28 18:08, kernel test robot wrote:
> Hi Qu,
> 
> [FYI, it's a private test report for your RFC patch.]
> [auto build test WARNING on kdave/for-next]
> [also build test WARNING on next-20220428]
> [cannot apply to rostedt-trace/for-next v5.18-rc4]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-introduce-a-pure-data-checksum-checking-helper/20220427-161943
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> config: arm-randconfig-c002-20220428 (https://download.01.org/0day-ci/archive/20220428/202204281717.jzlTSpQj-lkp(a)intel.com/config)
> compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c59473aacce38cd7dd77eebceaf3c98c5707ab3b)
> reproduce (this is a W=1 build):
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # install arm cross compiling tool for clang build
>          # apt-get install binutils-arm-linux-gnueabi
>          # https://github.com/intel-lab-lkp/linux/commit/3f389ea1be2d5c290c4b523743ca200983f45765
>          git remote add linux-review https://github.com/intel-lab-lkp/linux
>          git fetch --no-tags linux-review Qu-Wenruo/btrfs-introduce-a-pure-data-checksum-checking-helper/20220427-161943
>          git checkout 3f389ea1be2d5c290c4b523743ca200983f45765
>          # save the config file
>          mkdir build_dir && cp config build_dir/.config
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash fs/btrfs/
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>):
> 
>>> fs/btrfs/extent_io.c:3076:6: warning: variable 'nbits' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>             if (ctrl->error) {
>                 ^~~~~~~~~~~
>     fs/btrfs/extent_io.c:3098:46: note: uninitialized use occurs here
>             for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
>                                                         ^~~~~
>     include/linux/find.h:284:16: note: expanded from macro 'for_each_set_bit'
>                  (bit) < (size);                                    \
>                           ^~~~
>     fs/btrfs/extent_io.c:3076:2: note: remove the 'if' if its condition is always false
>             if (ctrl->error) {
>             ^~~~~~~~~~~~~~~~~~
>     fs/btrfs/extent_io.c:3064:20: note: initialize the variable 'nbits' to silence this warning
>             unsigned int nbits;
>                               ^
>                                = 0
>     1 warning generated.
> 
> 
> vim +3076 fs/btrfs/extent_io.c
> 
>    3060	
>    3061	static void read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
>    3062	{
>    3063		struct btrfs_fs_info *fs_info;
>    3064		unsigned int nbits;
>    3065		u32 sectorsize;
>    3066		int bit;
>    3067		int i;
>    3068	
>    3069		if (!ctrl->initialized)
>    3070			return;
>    3071	
>    3072		/*
>    3073		 * Got a critical -ENOMEM error preivously, no repair should have been
>    3074		 * attempted.
>    3075		 */
>> 3076		if (ctrl->error) {
>    3077			ASSERT(bio_list_empty(&ctrl->bios));
>    3078			ASSERT(atomic_read(&ctrl->io_bytes) == 0);
>    3079			goto mark_error;
>    3080		}
>    3081	
>    3082		ASSERT(ctrl->inode);
>    3083		fs_info = btrfs_sb(ctrl->inode->i_sb);
>    3084		nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
>    3085		sectorsize = fs_info->sectorsize;
>    3086	
>    3087		/* Go through each remaining mirrors to do the repair */
>    3088		for (i = get_next_mirror(ctrl->init_mirror, ctrl->num_copies);
>    3089		     i != ctrl->init_mirror; i = get_next_mirror(i, ctrl->num_copies)) {
>    3090			read_repair_from_one_mirror(ctrl, ctrl->inode, i);
>    3091	
>    3092			/* Check the error bitmap to see if no more corrupted sectors */
>    3093			if (bitmap_all_zero(ctrl->cur_bad_bitmap, nbits))
>    3094				break;
>    3095		}
>    3096	mark_error:
>    3097		/* Finish the unrecovered bad sectors */
>    3098		for_each_set_bit(bit, ctrl->cur_bad_bitmap, nbits) {
>    3099			struct page *page;
>    3100			unsigned int pgoff;
>    3101			u64 file_offset = (bit << fs_info->sectorsize_bits) +
>    3102					  ctrl->file_offset;
>    3103	
>    3104			page = read_repair_get_sector(ctrl, bit, &pgoff);
>    3105	
>    3106			end_page_read(page, false, file_offset, sectorsize);
>    3107			unlock_extent_cached_atomic(&BTRFS_I(ctrl->inode)->io_tree,
>    3108					file_offset, file_offset + sectorsize - 1, NULL);
>    3109		}
>    3110		kfree(ctrl->cur_bad_bitmap);
>    3111		kfree(ctrl->prev_bad_bitmap);
>    3112		ctrl->cur_bad_bitmap = NULL;
>    3113		ctrl->prev_bad_bitmap = NULL;
>    3114		ctrl->initialized = false;
>    3115		ctrl->error = false;
>    3116		ctrl->failed_bio = NULL;
>    3117		ASSERT(bio_list_empty(&ctrl->bios));
>    3118		ASSERT(atomic_read(&ctrl->io_bytes) == 0);
>    3119	}
>    3120	
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 01/12] btrfs: introduce a pure data checksum checking helper
  2022-04-27  7:18 ` [PATCH RFC v2 01/12] btrfs: introduce a pure data checksum checking helper Qu Wenruo
@ 2022-04-28 13:26   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-28 13:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> +/*
> + * Just verify one sector of csum, without any extra handling.
> + *
> + * Functions like btrfs_verify_data_csum() have extra handling like setting
> + * page flags which is not suitable for call sites like direct IO.
> + *
> + * This pure csum checking allows us to utilize it for call sites which need
> + * to handle page for both buffered and direct IO.
> + */

I think this is way too much historic context that's probably soon
rather irrelevant.

I'd simplify it down to something like:

/*
 * Verify the checksum for a single sector without any extra action that
 * depend on the type of I/O.
 */

> +int btrfs_check_data_sector(struct btrfs_fs_info *fs_info, struct page *page,
> +			    u32 pgoff, u8 *csum_expected)
> +{
> +	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
> +	char *kaddr;
> +	const u32 len = fs_info->sectorsize;
> +	const u32 csum_size = fs_info->csum_size;
> +	u8 csum[BTRFS_CSUM_SIZE];
> +
> +	ASSERT(pgoff + len <= PAGE_SIZE);
> +
> +	kaddr = kmap_atomic(page);

Please switch to kmap_local_page while you touch this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting
  2022-04-27  7:18 ` [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting Qu Wenruo
  2022-04-28  5:16   ` Qu Wenruo
@ 2022-04-28 13:32   ` Christoph Hellwig
  2022-04-28 22:41     ` Qu Wenruo
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-28 13:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Apr 27, 2022 at 03:18:48PM +0800, Qu Wenruo wrote:
> Lower level bio mapping, from driver to dm, even btrfs chunk mapping
> can modify bio::bi_iter.
> 
> This prevents us from doing two things:
> 
> - Iterate the bio range of a cloned bio
>   This is only utilized by direct IO, thus it's already using
>   btrfs_bio::iter, which is populated in btrfs_bio_clone_partial().
> 
> - Grab the original logical bytenr of a bio
>   This will be utilized by incoming read repair patches.
> 
> So to make sure all btrfs_bio submitted to own a proper iter, this patch
> will assigned btrfs_bio::iter in the following call sites:

Independent of what we want to do with the saved iter in the future
I don't think this is an improvement.  The place where we can start
advancing the iter is after the bio is submit by btrfs_map_bio, so that
is the one central place where it should be saved.  I actually had
a patch like that in one my my wip branches:

---
From: Christoph Hellwig <hch@lst.de>
Subject: btrfs: centralize stashing away ->bi_iter in btrfs_map_bio

Once a bio is submitted to a driver, the driver can advance bi_iter.
Save to the copy in the btrfs_bio just before that can happen rather
than in random places high up in the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 7 +------
 fs/btrfs/volumes.c   | 3 +++
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 07888cce3bce6..002b1ea92e398 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2683,7 +2683,6 @@ int btrfs_repair_one_sector(struct inode *inode,
 	}
 
 	bio_add_page(repair_bio, page, failrec->len, pgoff);
-	repair_bbio->iter = repair_bio->bi_iter;
 
 	btrfs_debug(btrfs_sb(inode->i_sb),
 		    "repair read error: submitting new read to mirror %d",
@@ -3209,14 +3208,11 @@ struct bio *btrfs_bio_alloc(unsigned int nr_iovecs)
 
 struct bio *btrfs_bio_clone(struct block_device *bdev, struct bio *bio)
 {
-	struct btrfs_bio *bbio;
 	struct bio *new;
 
 	/* Bio allocation backed by a bioset does not fail */
 	new = bio_alloc_clone(bdev, bio, GFP_NOFS, &btrfs_bioset);
-	bbio = btrfs_bio(new);
-	btrfs_bio_init(bbio);
-	bbio->iter = bio->bi_iter;
+	btrfs_bio_init(btrfs_bio(new));
 	return new;
 }
 
@@ -3235,7 +3231,6 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size)
 	btrfs_bio_init(bbio);
 
 	bio_trim(bio, offset >> 9, size >> 9);
-	bbio->iter = bio->bi_iter;
 	return bio;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 748614b00ffa2..f282a58a12344 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6750,6 +6750,9 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	int total_devs;
 	struct btrfs_io_context *bioc = NULL;
 
+	/* stash away the iter as the low-level processing can advance it */
+	btrfs_bio(bio)->iter = bio->bi_iter;
+
 	length = bio->bi_iter.bi_size;
 	map_length = length;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 03/12] btrfs: remove duplicated parameters from submit_data_read_repair()
  2022-04-27  7:18 ` [PATCH RFC v2 03/12] btrfs: remove duplicated parameters from submit_data_read_repair() Qu Wenruo
@ 2022-04-28 13:32   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-28 13:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-04-27  7:18 ` [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
@ 2022-04-28 13:37   ` Christoph Hellwig
  2022-04-28 22:51     ` Qu Wenruo
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-28 13:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Apr 27, 2022 at 03:18:50PM +0800, Qu Wenruo wrote:
> Currently we only allocate two bitmaps and initialize various members,
> no real work done yet.

I'm rather worried about these allocations.  These are called from
the I/O completion work queues, which can be rather deadlock heavy.
Never mind that just failing an I/O repair/recovery when we are out
of memory seems like a rather bad idea.

> +	if (!ctrl->initialized) {

I don't think you need the initialize field.  Just check for
failed_bio being non-NULL to simplify this.

> +		const u32 sectorsize = fs_info->sectorsize;
> +
> +		ASSERT(ctrl->cur_bad_bitmap == NULL &&
> +		       ctrl->prev_bad_bitmap == NULL);
> +		/*
> +		 * failed_bio->bi_iter is not reliable at endio time, thus we
> +		 * must rely on btrfs_bio::iter to grab the original logical
> +		 * bytenr.
> +		 */
> +		ASSERT(btrfs_bio(failed_bio)->iter.bi_size);

Also things would be lot more readable if the code inside this branch
just moved into a helper that you call if ->failed_bio is not set.

> +		ctrl->cur_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
> +					fs_info->sectorsize_bits, GFP_NOFS);
> +		ctrl->prev_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
> +					fs_info->sectorsize_bits, GFP_NOFS);
> +		/* Just set the error bit, so we will never try repair */
> +		if (!ctrl->cur_bad_bitmap || !ctrl->prev_bad_bitmap) {
> +			kfree(ctrl->cur_bad_bitmap);
> +			kfree(ctrl->prev_bad_bitmap);
> +			ctrl->cur_bad_bitmap = NULL;
> +			ctrl->prev_bad_bitmap = NULL;
> +			ctrl->error = true;
> +		}

I don't think we need the extra error value either, you can just check
one of the bitmap pointers for NULL.  That being said, as mentioned
above I'm really worried about these huge allocations that can fail.
I think we need a mempool of some fixed size here and use that, and just
change the algorithm to work in chunks based on this upper bound.

> +/* Strucutre for data read time repair. */
> +struct btrfs_read_repair_ctrl {

Can we keep that structure private?  Based on the rest of the series
there actually is a fair amount of code using it, what about isolating
it in a new read_repair.c instead of in the giant extent_io.c and
inode.c files?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-04-28  5:20   ` Qu Wenruo
@ 2022-04-28 13:44     ` Christoph Hellwig
  2022-04-28 22:55       ` Qu Wenruo
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-28 13:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Christoph Hellwig

On Thu, Apr 28, 2022 at 01:20:37PM +0800, Qu Wenruo wrote:
> This function will get called very frequently, and I really want to avoid
> doing such re-search every time from the beginning of the original bio.
> 
> Maybe we can cache a bvec_iter and using the bi_size to check if the target
> offset is still beyond us (then advance), or re-wind and re-search from the
> beginning.
> 
> I guess there is no existing helper to do the same work, right?

It is basically impossible to review this because you just add a
standalone static helper without the callers.  Please split the
work into logical chunks that actually are reviewable.  Yes, that will
be a pretty large patch, but that's still much better than having to
jump around hal a dozen ones.

No, there is no way to efficiently look up what bvec in a bio some
offset falls into, because the bvecs are variable length.

It seems like the caller (at least the one added a little later, not
sure if there are more) is iterating over the bitmap and then calls
this for every bit set.  So for that you're much better off making
the __bio_for_each_segment the main loop, and then at the beginning of
the loop checking the bitmap if we need to handle this sector.


> > +	struct bio_list read_bios;

I'd just calls this bios.  Obviously they are used for reading here.

Also please be very careful about dead locks.  The mempool for the
btrfs_bios is small right now, you need to size it up by the
largerst number of bios that can be on this list.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 10/12] btrfs: cleanup btrfs_repair_one_sector()
  2022-04-27  7:18 ` [PATCH RFC v2 10/12] btrfs: cleanup btrfs_repair_one_sector() Qu Wenruo
@ 2022-04-28 13:45   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-28 13:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

The next two patches should really be merged with this one as all that
infrastructure is related.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting
  2022-04-28 13:32   ` Christoph Hellwig
@ 2022-04-28 22:41     ` Qu Wenruo
  2022-04-29 15:09       ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-04-28 22:41 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/4/28 21:32, Christoph Hellwig wrote:
> On Wed, Apr 27, 2022 at 03:18:48PM +0800, Qu Wenruo wrote:
>> Lower level bio mapping, from driver to dm, even btrfs chunk mapping
>> can modify bio::bi_iter.
>>
>> This prevents us from doing two things:
>>
>> - Iterate the bio range of a cloned bio
>>    This is only utilized by direct IO, thus it's already using
>>    btrfs_bio::iter, which is populated in btrfs_bio_clone_partial().
>>
>> - Grab the original logical bytenr of a bio
>>    This will be utilized by incoming read repair patches.
>>
>> So to make sure all btrfs_bio submitted to own a proper iter, this patch
>> will assigned btrfs_bio::iter in the following call sites:
>
> Independent of what we want to do with the saved iter in the future
> I don't think this is an improvement.  The place where we can start
> advancing the iter is after the bio is submit by btrfs_map_bio, so that
> is the one central place where it should be saved.  I actually had
> a patch like that in one my my wip branches:
>
> ---
> From: Christoph Hellwig <hch@lst.de>
> Subject: btrfs: centralize stashing away ->bi_iter in btrfs_map_bio
>
> Once a bio is submitted to a driver, the driver can advance bi_iter.
> Save to the copy in the btrfs_bio just before that can happen rather
> than in random places high up in the stack.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/extent_io.c | 7 +------
>   fs/btrfs/volumes.c   | 3 +++
>   2 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 07888cce3bce6..002b1ea92e398 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2683,7 +2683,6 @@ int btrfs_repair_one_sector(struct inode *inode,
>   	}
>
>   	bio_add_page(repair_bio, page, failrec->len, pgoff);
> -	repair_bbio->iter = repair_bio->bi_iter;
>
>   	btrfs_debug(btrfs_sb(inode->i_sb),
>   		    "repair read error: submitting new read to mirror %d",
> @@ -3209,14 +3208,11 @@ struct bio *btrfs_bio_alloc(unsigned int nr_iovecs)
>
>   struct bio *btrfs_bio_clone(struct block_device *bdev, struct bio *bio)
>   {
> -	struct btrfs_bio *bbio;
>   	struct bio *new;
>
>   	/* Bio allocation backed by a bioset does not fail */
>   	new = bio_alloc_clone(bdev, bio, GFP_NOFS, &btrfs_bioset);
> -	bbio = btrfs_bio(new);
> -	btrfs_bio_init(bbio);
> -	bbio->iter = bio->bi_iter;
> +	btrfs_bio_init(btrfs_bio(new));
>   	return new;
>   }
>
> @@ -3235,7 +3231,6 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size)
>   	btrfs_bio_init(bbio);
>
>   	bio_trim(bio, offset >> 9, size >> 9);
> -	bbio->iter = bio->bi_iter;
>   	return bio;
>   }
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 748614b00ffa2..f282a58a12344 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6750,6 +6750,9 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   	int total_devs;
>   	struct btrfs_io_context *bioc = NULL;
>
> +	/* stash away the iter as the low-level processing can advance it */
> +	btrfs_bio(bio)->iter = bio->bi_iter;

The problem here is, if any endio function needs to grab the original
bio, and btrfs submit bio hooks failed before btrfs_map_bio(), like some
failure from btrfs_bio_wq_end_io(), then we will call bio_endio(), and
the endio just got an empty iter.

Thanks,
Qu

> +
>   	length = bio->bi_iter.bi_size;
>   	map_length = length;
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-04-28 13:37   ` Christoph Hellwig
@ 2022-04-28 22:51     ` Qu Wenruo
  2022-04-29  0:09       ` Qu Wenruo
  2022-04-29 15:11       ` Christoph Hellwig
  0 siblings, 2 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-28 22:51 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/4/28 21:37, Christoph Hellwig wrote:
> On Wed, Apr 27, 2022 at 03:18:50PM +0800, Qu Wenruo wrote:
>> Currently we only allocate two bitmaps and initialize various members,
>> no real work done yet.
>
> I'm rather worried about these allocations.  These are called from
> the I/O completion work queues, which can be rather deadlock heavy.
> Never mind that just failing an I/O repair/recovery when we are out
> of memory seems like a rather bad idea.

That's why there is a btrfs_read_repair_ctrl::error member.

If we failed the memory allocation, then we will not do any repair.

To me, memory allocation is a much bigger problem.


Although we can put the bitmap into btrfs_bio structure, and
pre-allocate it for every bio we're going to submit.

But I'm not sure if the pre-allocation way is a good idea, considering
read-repair should be a relatively code path.

>
>> +	if (!ctrl->initialized) {
>
> I don't think you need the initialize field.  Just check for
> failed_bio being non-NULL to simplify this.

That's indeed simpler.

>
>> +		const u32 sectorsize = fs_info->sectorsize;
>> +
>> +		ASSERT(ctrl->cur_bad_bitmap == NULL &&
>> +		       ctrl->prev_bad_bitmap == NULL);
>> +		/*
>> +		 * failed_bio->bi_iter is not reliable at endio time, thus we
>> +		 * must rely on btrfs_bio::iter to grab the original logical
>> +		 * bytenr.
>> +		 */
>> +		ASSERT(btrfs_bio(failed_bio)->iter.bi_size);
>
> Also things would be lot more readable if the code inside this branch
> just moved into a helper that you call if ->failed_bio is not set.

Indeed.

>
>> +		ctrl->cur_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
>> +					fs_info->sectorsize_bits, GFP_NOFS);
>> +		ctrl->prev_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
>> +					fs_info->sectorsize_bits, GFP_NOFS);
>> +		/* Just set the error bit, so we will never try repair */
>> +		if (!ctrl->cur_bad_bitmap || !ctrl->prev_bad_bitmap) {
>> +			kfree(ctrl->cur_bad_bitmap);
>> +			kfree(ctrl->prev_bad_bitmap);
>> +			ctrl->cur_bad_bitmap = NULL;
>> +			ctrl->prev_bad_bitmap = NULL;
>> +			ctrl->error = true;
>> +		}
>
> I don't think we need the extra error value either, you can just check
> one of the bitmap pointers for NULL.  That being said, as mentioned
> above I'm really worried about these huge allocations that can fail.
> I think we need a mempool of some fixed size here and use that, and just
> change the algorithm to work in chunks based on this upper bound.
>
>> +/* Strucutre for data read time repair. */
>> +struct btrfs_read_repair_ctrl {
>
> Can we keep that structure private?  Based on the rest of the series
> there actually is a fair amount of code using it, what about isolating
> it in a new read_repair.c instead of in the giant extent_io.c and
> inode.c files?

I was considering putting it into read_repair.c, and since you're also
mentioning that, I guess it's a good idea now.

And if we're going to make that structure private, I guess we have to
pre-allocate it in btrfs_bio as a pointer then.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-04-28 13:44     ` Christoph Hellwig
@ 2022-04-28 22:55       ` Qu Wenruo
  2022-04-29  7:11         ` Qu Wenruo
  2022-04-29 15:14         ` Christoph Hellwig
  0 siblings, 2 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-28 22:55 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/4/28 21:44, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 01:20:37PM +0800, Qu Wenruo wrote:
>> This function will get called very frequently, and I really want to avoid
>> doing such re-search every time from the beginning of the original bio.
>>
>> Maybe we can cache a bvec_iter and using the bi_size to check if the target
>> offset is still beyond us (then advance), or re-wind and re-search from the
>> beginning.
>>
>> I guess there is no existing helper to do the same work, right?
>
> It is basically impossible to review this because you just add a
> standalone static helper without the callers.  Please split the
> work into logical chunks that actually are reviewable.  Yes, that will
> be a pretty large patch, but that's still much better than having to
> jump around hal a dozen ones.
>
> No, there is no way to efficiently look up what bvec in a bio some
> offset falls into, because the bvecs are variable length.
>
> It seems like the caller (at least the one added a little later, not
> sure if there are more) is iterating over the bitmap and then calls
> this for every bit set.  So for that you're much better off making
> the __bio_for_each_segment the main loop, and then at the beginning of
> the loop checking the bitmap if we need to handle this sector.
>
>
>>> +	struct bio_list read_bios;
>
> I'd just calls this bios.  Obviously they are used for reading here.
>
> Also please be very careful about dead locks.  The mempool for the
> btrfs_bios is small right now, you need to size it up by the
> largerst number of bios that can be on this list.

In fact I have some version locally checking the return value from
btrfs_bio_alloc(), if we failed to alloc memory, then just mark the
btrfs_read_repair_ctrl::error bit, and mark all remaining bad sectors as
error, no more repairing.

As memory allocation failure is much more a problem than failed read repair.

Another consideration is, would it really dead lock?

We're only called for read path, not writeback path.
IIRC it's easier to hit dead lock at writeback path, as writeback can be
triggered by memory pressure.

But would the same problem happen just for read path?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-04-28 22:51     ` Qu Wenruo
@ 2022-04-29  0:09       ` Qu Wenruo
  2022-04-29 15:12         ` Christoph Hellwig
  2022-04-29 15:11       ` Christoph Hellwig
  1 sibling, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-04-29  0:09 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/4/29 06:51, Qu Wenruo wrote:
>
>
> On 2022/4/28 21:37, Christoph Hellwig wrote:
>> On Wed, Apr 27, 2022 at 03:18:50PM +0800, Qu Wenruo wrote:
>>> Currently we only allocate two bitmaps and initialize various members,
>>> no real work done yet.
>>
>> I'm rather worried about these allocations.  These are called from
>> the I/O completion work queues, which can be rather deadlock heavy.
>> Never mind that just failing an I/O repair/recovery when we are out
>> of memory seems like a rather bad idea.
>
> That's why there is a btrfs_read_repair_ctrl::error member.
>
> If we failed the memory allocation, then we will not do any repair.
>
> To me, memory allocation is a much bigger problem.
>
>
> Although we can put the bitmap into btrfs_bio structure, and
> pre-allocate it for every bio we're going to submit.
>
> But I'm not sure if the pre-allocation way is a good idea, considering
> read-repair should be a relatively code path.

s/code/cold/

But at least for the bio submission path, we have way less strict
context, thus it's indeed a better call sites.

On the other hand, we're already allocating memory inside endio
function, for a long long time.

Check endio_readpage_release_extent(), they all need to pre-allocate
memory for extent tree update (although using mostly ATOMIC gfp flags).

I guess it's an abuse and we would like to remove it in the long run?

Thanks,
Qu

>
>>
>>> +    if (!ctrl->initialized) {
>>
>> I don't think you need the initialize field.  Just check for
>> failed_bio being non-NULL to simplify this.
>
> That's indeed simpler.
>
>>
>>> +        const u32 sectorsize = fs_info->sectorsize;
>>> +
>>> +        ASSERT(ctrl->cur_bad_bitmap == NULL &&
>>> +               ctrl->prev_bad_bitmap == NULL);
>>> +        /*
>>> +         * failed_bio->bi_iter is not reliable at endio time, thus we
>>> +         * must rely on btrfs_bio::iter to grab the original logical
>>> +         * bytenr.
>>> +         */
>>> +        ASSERT(btrfs_bio(failed_bio)->iter.bi_size);
>>
>> Also things would be lot more readable if the code inside this branch
>> just moved into a helper that you call if ->failed_bio is not set.
>
> Indeed.
>
>>
>>> +        ctrl->cur_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
>>> +                    fs_info->sectorsize_bits, GFP_NOFS);
>>> +        ctrl->prev_bad_bitmap = bitmap_alloc(ctrl->bio_size >>
>>> +                    fs_info->sectorsize_bits, GFP_NOFS);
>>> +        /* Just set the error bit, so we will never try repair */
>>> +        if (!ctrl->cur_bad_bitmap || !ctrl->prev_bad_bitmap) {
>>> +            kfree(ctrl->cur_bad_bitmap);
>>> +            kfree(ctrl->prev_bad_bitmap);
>>> +            ctrl->cur_bad_bitmap = NULL;
>>> +            ctrl->prev_bad_bitmap = NULL;
>>> +            ctrl->error = true;
>>> +        }
>>
>> I don't think we need the extra error value either, you can just check
>> one of the bitmap pointers for NULL.  That being said, as mentioned
>> above I'm really worried about these huge allocations that can fail.
>> I think we need a mempool of some fixed size here and use that, and just
>> change the algorithm to work in chunks based on this upper bound.
>>
>>> +/* Strucutre for data read time repair. */
>>> +struct btrfs_read_repair_ctrl {
>>
>> Can we keep that structure private?  Based on the rest of the series
>> there actually is a fair amount of code using it, what about isolating
>> it in a new read_repair.c instead of in the giant extent_io.c and
>> inode.c files?
>
> I was considering putting it into read_repair.c, and since you're also
> mentioning that, I guess it's a good idea now.
>
> And if we're going to make that structure private, I guess we have to
> pre-allocate it in btrfs_bio as a pointer then.
>
> Thanks,
> Qu
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-04-28 22:55       ` Qu Wenruo
@ 2022-04-29  7:11         ` Qu Wenruo
  2022-04-29 15:14         ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-29  7:11 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/4/29 06:55, Qu Wenruo wrote:
>
>
> On 2022/4/28 21:44, Christoph Hellwig wrote:
>> On Thu, Apr 28, 2022 at 01:20:37PM +0800, Qu Wenruo wrote:
>>> This function will get called very frequently, and I really want to
>>> avoid
>>> doing such re-search every time from the beginning of the original bio.
>>>
>>> Maybe we can cache a bvec_iter and using the bi_size to check if the
>>> target
>>> offset is still beyond us (then advance), or re-wind and re-search
>>> from the
>>> beginning.
>>>
>>> I guess there is no existing helper to do the same work, right?
>>
>> It is basically impossible to review this because you just add a
>> standalone static helper without the callers.  Please split the
>> work into logical chunks that actually are reviewable.  Yes, that will
>> be a pretty large patch, but that's still much better than having to
>> jump around hal a dozen ones.
>>
>> No, there is no way to efficiently look up what bvec in a bio some
>> offset falls into, because the bvecs are variable length.
>>
>> It seems like the caller (at least the one added a little later, not
>> sure if there are more) is iterating over the bitmap and then calls
>> this for every bit set.  So for that you're much better off making
>> the __bio_for_each_segment the main loop, and then at the beginning of
>> the loop checking the bitmap if we need to handle this sector.
>>
>>
>>>> +    struct bio_list read_bios;
>>
>> I'd just calls this bios.  Obviously they are used for reading here.
>>
>> Also please be very careful about dead locks.  The mempool for the
>> btrfs_bios is small right now, you need to size it up by the
>> largerst number of bios that can be on this list.

Forgot to mention one thing here.

The old behavior will also allocate new bios for repair, even it's even
worse, check out the old btrsf_repair_one_sector(), which will always
allocate a btrfs_bio, for every *sector* corrupted.

While in our new code, that would only be the worst case.

If we have adjacent corrupted sectors, we will just append the new
sector into the existing bios.

So if we have dead locks, I believe we should hit more frequently using
the old code.

Thus the new behavior is no worst than the old one, except the extra
bitmap allocation (which I'll try to move them to btrfs_bio then, maybe
using a union to share the csum pointer).

Thanks,
Qu

>
> In fact I have some version locally checking the return value from
> btrfs_bio_alloc(), if we failed to alloc memory, then just mark the
> btrfs_read_repair_ctrl::error bit, and mark all remaining bad sectors as
> error, no more repairing.
>
> As memory allocation failure is much more a problem than failed read
> repair.
>
> Another consideration is, would it really dead lock?
>
> We're only called for read path, not writeback path.
> IIRC it's easier to hit dead lock at writeback path, as writeback can be
> triggered by memory pressure.
>
> But would the same problem happen just for read path?
>
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting
  2022-04-28 22:41     ` Qu Wenruo
@ 2022-04-29 15:09       ` Christoph Hellwig
  2022-04-29 23:04         ` Qu Wenruo
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-29 15:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Fri, Apr 29, 2022 at 06:41:15AM +0800, Qu Wenruo wrote:
> The problem here is, if any endio function needs to grab the original
> bio, and btrfs submit bio hooks failed before btrfs_map_bio(), like some
> failure from btrfs_bio_wq_end_io(), then we will call bio_endio(), and
> the endio just got an empty iter.

True.  But the only places that uses it is direct I/O read repair (and
with this seris buffered I/O read repair), and those should not run
when we fail to even submit the bio.  We currently do, but maybe we
need to fix that first?  In fact with my pending bio cleanup series
we should be more or less there.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-04-28 22:51     ` Qu Wenruo
  2022-04-29  0:09       ` Qu Wenruo
@ 2022-04-29 15:11       ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-29 15:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

> > the I/O completion work queues, which can be rather deadlock heavy.
> > Never mind that just failing an I/O repair/recovery when we are out
> > of memory seems like a rather bad idea.
> 
> That's why there is a btrfs_read_repair_ctrl::error member.
> 
> If we failed the memory allocation, then we will not do any repair.

And that's a little broken, isn't it?

> To me, memory allocation is a much bigger problem.
> 
> 
> Although we can put the bitmap into btrfs_bio structure, and
> pre-allocate it for every bio we're going to submit.
> 
> But I'm not sure if the pre-allocation way is a good idea, considering
> read-repair should be a relatively code path.

And that is what a mempool is for.

> 
> > it in a new read_repair.c instead of in the giant extent_io.c and
> > inode.c files?
> 
> I was considering putting it into read_repair.c, and since you're also
> mentioning that, I guess it's a good idea now.
> 
> And if we're going to make that structure private, I guess we have to
> pre-allocate it in btrfs_bio as a pointer then.

I really think a mempool is the right choice here.  This gives us a
guaranteed number of objects to make progress, even allowing for more
if memory pressure allows for it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-04-29  0:09       ` Qu Wenruo
@ 2022-04-29 15:12         ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-29 15:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Fri, Apr 29, 2022 at 08:09:38AM +0800, Qu Wenruo wrote:
> On the other hand, we're already allocating memory inside endio
> function, for a long long time.
> 
> Check endio_readpage_release_extent(), they all need to pre-allocate
> memory for extent tree update (although using mostly ATOMIC gfp flags).
> 
> I guess it's an abuse and we would like to remove it in the long run?

Yes, we should handle it.  In practice small GFP_ATOMIC allocations do
not fail, so this probably hasn't been much of an issue.

Allocations that can be rather larger like the bitmaps on the other
hand do have a real chance of failing under enough memory pressure.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-04-28 22:55       ` Qu Wenruo
  2022-04-29  7:11         ` Qu Wenruo
@ 2022-04-29 15:14         ` Christoph Hellwig
  2022-04-29 23:08           ` Qu Wenruo
  2022-05-01 23:59           ` Qu Wenruo
  1 sibling, 2 replies; 38+ messages in thread
From: Christoph Hellwig @ 2022-04-29 15:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Fri, Apr 29, 2022 at 06:55:59AM +0800, Qu Wenruo wrote:
> Another consideration is, would it really dead lock?
> 
> We're only called for read path, not writeback path.
> IIRC it's easier to hit dead lock at writeback path, as writeback can be
> triggered by memory pressure.
> 
> But would the same problem happen just for read path?

System with sever memory pressure needs to page something in to get
something out.  The readpage uses the last available bio in the btrfs
bioset, but that read now needs a read repair, which needs to allocate
another bio from the bio_set -> deadlock.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting
  2022-04-29 15:09       ` Christoph Hellwig
@ 2022-04-29 23:04         ` Qu Wenruo
  0 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-29 23:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/4/29 23:09, Christoph Hellwig wrote:
> On Fri, Apr 29, 2022 at 06:41:15AM +0800, Qu Wenruo wrote:
>> The problem here is, if any endio function needs to grab the original
>> bio, and btrfs submit bio hooks failed before btrfs_map_bio(), like some
>> failure from btrfs_bio_wq_end_io(), then we will call bio_endio(), and
>> the endio just got an empty iter.
>
> True.  But the only places that uses it is direct I/O read repair (and
> with this seris buffered I/O read repair), and those should not run
> when we fail to even submit the bio.  We currently do, but maybe we
> need to fix that first?  In fact with my pending bio cleanup series
> we should be more or less there.

OK, if we determine to make sure we only save the iter after
btrfs_map_bio(), then I'm completely fine with that.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-04-29 15:14         ` Christoph Hellwig
@ 2022-04-29 23:08           ` Qu Wenruo
  2022-05-01 23:59           ` Qu Wenruo
  1 sibling, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-04-29 23:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/4/29 23:14, Christoph Hellwig wrote:
> On Fri, Apr 29, 2022 at 06:55:59AM +0800, Qu Wenruo wrote:
>> Another consideration is, would it really dead lock?
>>
>> We're only called for read path, not writeback path.
>> IIRC it's easier to hit dead lock at writeback path, as writeback can be
>> triggered by memory pressure.
>>
>> But would the same problem happen just for read path?
>
> System with sever memory pressure needs to page something in to get
> something out.  The readpage uses the last available bio in the btrfs
> bioset, but that read now needs a read repair, which needs to allocate
> another bio from the bio_set -> deadlock.

Thanks for the reason behind that.

Then I'm wondering why the original code is not causing problems (or at
least not so obviously), even the old code is allocating more bios.

Could it be the fact that, old code always submit the bio after
allocation immediately?

If so, I can get rid of the bio list and just hold the last bio for bio
merging, and submit the current one if we got a unmergable sector.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-04-29 15:14         ` Christoph Hellwig
  2022-04-29 23:08           ` Qu Wenruo
@ 2022-05-01 23:59           ` Qu Wenruo
  2022-05-02 16:45             ` Christoph Hellwig
  1 sibling, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2022-05-01 23:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/4/29 23:14, Christoph Hellwig wrote:
> On Fri, Apr 29, 2022 at 06:55:59AM +0800, Qu Wenruo wrote:
>> Another consideration is, would it really dead lock?
>>
>> We're only called for read path, not writeback path.
>> IIRC it's easier to hit dead lock at writeback path, as writeback can be
>> triggered by memory pressure.
>>
>> But would the same problem happen just for read path?
>
> System with sever memory pressure needs to page something in to get
> something out.  The readpage uses the last available bio in the btrfs
> bioset, but that read now needs a read repair, which needs to allocate
> another bio from the bio_set -> deadlock.

So with your previous mention on mempool, did you mean that we allocate
another mempool for read repair only?

With the extra read repair mempool, even we exhaust the last btrfs bio
in the pool, we still have something like btrfs read repair bio?
(And we can get rid of the extra members unused in btrfs bio, since what
we really need is just a logical bytenr and pointer back to
btrfs_read_repair_ctrl).

This sounds pretty good to me then.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-05-01 23:59           ` Qu Wenruo
@ 2022-05-02 16:45             ` Christoph Hellwig
  2022-05-02 23:00               ` Qu Wenruo
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2022-05-02 16:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Mon, May 02, 2022 at 07:59:43AM +0800, Qu Wenruo wrote:
> So with your previous mention on mempool, did you mean that we allocate
> another mempool for read repair only?
> 
> With the extra read repair mempool, even we exhaust the last btrfs bio
> in the pool, we still have something like btrfs read repair bio?
> (And we can get rid of the extra members unused in btrfs bio, since what
> we really need is just a logical bytenr and pointer back to
> btrfs_read_repair_ctrl).
> 
> This sounds pretty good to me then.

I primarily meant a separate mempool for the actual read_repair_ctrl
structure with embedded bitmaps.  But yes, bios also do need to be taken
care off, and a different bio_set might work very well there.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair
  2022-05-02 16:45             ` Christoph Hellwig
@ 2022-05-02 23:00               ` Qu Wenruo
  0 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2022-05-02 23:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/5/3 00:45, Christoph Hellwig wrote:
> On Mon, May 02, 2022 at 07:59:43AM +0800, Qu Wenruo wrote:
>> So with your previous mention on mempool, did you mean that we allocate
>> another mempool for read repair only?
>>
>> With the extra read repair mempool, even we exhaust the last btrfs bio
>> in the pool, we still have something like btrfs read repair bio?
>> (And we can get rid of the extra members unused in btrfs bio, since what
>> we really need is just a logical bytenr and pointer back to
>> btrfs_read_repair_ctrl).
>>
>> This sounds pretty good to me then.
>
> I primarily meant a separate mempool for the actual read_repair_ctrl
> structure with embedded bitmaps.

For this, I kept the path of on-stack read_repair_ctrl, but for bitmap I
pre-allocate it for every data read bio at btrfs_submit_data_bio().

If we failed to pre-allocated the bitmap, we direct error out the bio.

Not confident about variable length bitmap to be allocated from mempool.

Thanks,
Qu

>  But yes, bios also do need to be taken
> care off, and a different bio_set might work very well there.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-05-02 23:00 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1651043617.git.wqu@suse.com>
2022-04-27  7:18 ` [PATCH RFC v2 01/12] btrfs: introduce a pure data checksum checking helper Qu Wenruo
2022-04-28 13:26   ` Christoph Hellwig
2022-04-27  7:18 ` [PATCH RFC v2 02/12] btrfs: always save bio::bi_iter into btrfs_bio::iter before submitting Qu Wenruo
2022-04-28  5:16   ` Qu Wenruo
2022-04-28 13:32   ` Christoph Hellwig
2022-04-28 22:41     ` Qu Wenruo
2022-04-29 15:09       ` Christoph Hellwig
2022-04-29 23:04         ` Qu Wenruo
2022-04-27  7:18 ` [PATCH RFC v2 03/12] btrfs: remove duplicated parameters from submit_data_read_repair() Qu Wenruo
2022-04-28 13:32   ` Christoph Hellwig
2022-04-27  7:18 ` [PATCH RFC v2 04/12] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
2022-04-28 13:37   ` Christoph Hellwig
2022-04-28 22:51     ` Qu Wenruo
2022-04-29  0:09       ` Qu Wenruo
2022-04-29 15:12         ` Christoph Hellwig
2022-04-29 15:11       ` Christoph Hellwig
2022-04-27  7:18 ` [PATCH RFC v2 05/12] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
2022-04-28  5:20   ` Qu Wenruo
2022-04-28 13:44     ` Christoph Hellwig
2022-04-28 22:55       ` Qu Wenruo
2022-04-29  7:11         ` Qu Wenruo
2022-04-29 15:14         ` Christoph Hellwig
2022-04-29 23:08           ` Qu Wenruo
2022-05-01 23:59           ` Qu Wenruo
2022-05-02 16:45             ` Christoph Hellwig
2022-05-02 23:00               ` Qu Wenruo
2022-04-27  7:18 ` [PATCH RFC v2 06/12] btrfs: introduce a helper to repair from one mirror Qu Wenruo
2022-04-27  7:18 ` [PATCH RFC v2 07/12] btrfs: allow btrfs read repair to submit all writes in one go Qu Wenruo
2022-04-27  7:18 ` [PATCH RFC v2 08/12] btrfs: switch buffered read to the new btrfs_read_repair_* based repair routine Qu Wenruo
2022-04-27 13:59   ` kernel test robot
2022-04-28 10:08   ` kernel test robot
2022-04-28 10:51     ` Qu Wenruo
2022-04-28 10:51       ` Qu Wenruo
2022-04-27  7:18 ` [PATCH RFC v2 09/12] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl Qu Wenruo
2022-04-27  7:18 ` [PATCH RFC v2 10/12] btrfs: cleanup btrfs_repair_one_sector() Qu Wenruo
2022-04-28 13:45   ` Christoph Hellwig
2022-04-27  7:18 ` [PATCH RFC v2 11/12] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
2022-04-27  7:18 ` [PATCH RFC v2 12/12] btrfs: remove btrfs_inode::io_failure_tree Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.