All of lore.kernel.org
 help / color / mirror / Atom feed
* simple synchronous read repair v2
@ 2022-05-27  8:43 Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 1/9] btrfs: save the original bi_iter into btrfs_bio for buffered read Christoph Hellwig
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

Hi all,

this is my take on the read repair code.  It borrow a lot of concepts
and patches from Qu's attempt.  The big difference is that it does away
with multiple in-flight repair bios, but instead just does one at a
time, but tries to make it as big as possible.

My aim here is mostly to fix up the messy I/O completions for the
direct I/O path, that make further bio optimizations rather annoying,
but it also gives better I/O patterns for repair (although I'm not sure
anyone cares) and removes a fair chunk of code.

[this is on top of a not commit yet series.  I kinda hate doing that, but
 with all the hot repair discussion going on right now we might as well
 have an uptodate version of this series out on the list]

Git tree:

   git://git.infradead.org/users/hch/misc.git btrfs-read_repair

Gitweb:

   http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-read_repair

Changes since v1:
 - rebased on top of the "misc btrfs cleanups" series that contains
   various patches previously in this series and the
   "cleanup btrfs bio handling, part 2 v4" series
 - handle partial reads due to checksum failures
 - handle large I/O for parity RAID repair as well
 - add a lot more comments
 - rename btrfs_read_repair_end to btrfs_read_repair_finish

Diffstat:
 fs/btrfs/Makefile            |    2 
 fs/btrfs/btrfs_inode.h       |    5 
 fs/btrfs/extent-io-tree.h    |   15 -
 fs/btrfs/extent_io.c         |  570 ++++---------------------------------------
 fs/btrfs/extent_io.h         |   25 -
 fs/btrfs/inode.c             |   55 +---
 fs/btrfs/read-repair.c       |  248 ++++++++++++++++++
 fs/btrfs/read-repair.h       |   33 ++
 fs/btrfs/super.c             |    9 
 fs/btrfs/volumes.c           |   97 +++++++
 fs/btrfs/volumes.h           |    2 
 include/trace/events/btrfs.h |    1 
 12 files changed, 467 insertions(+), 595 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/9] btrfs: save the original bi_iter into btrfs_bio for buffered read
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 2/9] btrfs: set ->file_offset in end_bio_extent_readpage Christoph Hellwig
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

Although we have btrfs_bio::iter, it currently have very limited usage:

- RAID56
  Which is not needed at all

- btrfs_bio_clone()
  This is used mostly for direct IO.

For the incoming read repair patches, we want to grab the original
logical bytenr, and be able to iterate the range of the bio (no matter
if it's cloned).

So this patch will also save btrfs_bio::iter for buffered read bios at
submit_one_bio().
And for the sake of consistency, also save the btrfs_bio::iter for
direct IO at btrfs_submit_dio_bio().

The reason that we didn't save the iter in btrfs_map_bio() is,
btrfs_map_bio() is going to handle various bios, with or without
btrfs_bio bioset.
And we  want to keep btrfs_map_bio() to handle and only handle plain bios
without bother the bioset.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eaeef64ea3486..025444aba2847 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2633,6 +2633,10 @@ void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	blk_status_t ret;
 
+	/* save the original iter for read repair */
+	if (bio_op(bio) == REQ_OP_READ)
+		btrfs_bio(bio)->iter = bio->bi_iter;
+
 	if (compress_type != BTRFS_COMPRESS_NONE) {
 		/*
 		 * btrfs_submit_compressed_read will handle completing the bio
@@ -7947,6 +7951,10 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_dio_private *dip = bio->bi_private;
 	blk_status_t ret;
+		
+	/* save the original iter for read repair */
+	if (btrfs_op(bio) == BTRFS_MAP_READ) {
+		btrfs_bio(bio)->iter = bio->bi_iter;
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
 		goto map;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/9] btrfs: set ->file_offset in end_bio_extent_readpage
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 1/9] btrfs: save the original bi_iter into btrfs_bio for buffered read Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 3/9] btrfs: factor out a btrfs_map_repair_bio helper Christoph Hellwig
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

The new repair code expects ->file_offset to be set for all bios.  Set
it just after entering end_bio_extent_readpage.  As that requires looking
at the first vector before the first loop iteration also use that
opportunity to set various file-wide variables just once instead of once
per loop iteration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 48463705ef0e6..54a0ec62c7b0d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3009,25 +3009,30 @@ static struct extent_buffer *find_extent_buffer_readpage(
  */
 static void end_bio_extent_readpage(struct bio *bio)
 {
+	struct bio_vec *first_vec = bio_first_bvec_all(bio);
+	struct inode *inode = first_vec->bv_page->mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	const u32 sectorsize = fs_info->sectorsize;
 	struct bio_vec *bvec;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
-	struct extent_io_tree *tree, *failure_tree;
+	int mirror = bbio->mirror_num;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+	bool uptodate = !bio->bi_status;
 	struct processed_extent processed = { 0 };
 	/*
 	 * The offset to the beginning of a bio, since one bio can never be
 	 * larger than UINT_MAX, u32 here is enough.
 	 */
 	u32 bio_offset = 0;
-	int mirror;
 	struct bvec_iter_all iter_all;
 
+	btrfs_bio(bio)->file_offset =
+		page_offset(first_vec->bv_page) + first_vec->bv_offset;
+
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		bool uptodate = !bio->bi_status;
 		struct page *page = bvec->bv_page;
-		struct inode *inode = page->mapping->host;
-		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-		const u32 sectorsize = fs_info->sectorsize;
 		unsigned int error_bitmap = (unsigned int)-1;
 		bool repair = false;
 		u64 start;
@@ -3037,9 +3042,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 		btrfs_debug(fs_info,
 			"end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u",
 			bio->bi_iter.bi_sector, bio->bi_status,
-			bbio->mirror_num);
-		tree = &BTRFS_I(inode)->io_tree;
-		failure_tree = &BTRFS_I(inode)->io_failure_tree;
+			mirror);
 
 		/*
 		 * We always issue full-sector reads, but if some block in a
@@ -3062,7 +3065,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 		end = start + bvec->bv_len - 1;
 		len = bvec->bv_len;
 
-		mirror = bbio->mirror_num;
 		if (likely(uptodate)) {
 			if (is_data_inode(inode)) {
 				error_bitmap = btrfs_verify_data_csum(bbio,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/9] btrfs: factor out a btrfs_map_repair_bio helper
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 1/9] btrfs: save the original bi_iter into btrfs_bio for buffered read Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 2/9] btrfs: set ->file_offset in end_bio_extent_readpage Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 4/9] btrfs: support read bios in btrfs_map_repair_bio Christoph Hellwig
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

Factor out the guts of repair_io_failure so that we have a new helper
to submit synchronous I/O for repair.  Unlike repair_io_failure itself
this helper also handles reads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 82 ++++++--------------------------------------
 fs/btrfs/volumes.c   | 69 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h   |  2 ++
 3 files changed, 81 insertions(+), 72 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 54a0ec62c7b0d..27775031ed2d4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2308,27 +2308,13 @@ int free_io_failure(struct extent_io_tree *failure_tree,
 	return err;
 }
 
-/*
- * this bypasses the standard btrfs submit functions deliberately, as
- * the standard behavior is to write all copies in a raid setup. here we only
- * want to write the one bad copy. so we do the mapping for ourselves and issue
- * submit_bio directly.
- * to avoid any synchronization issues, wait for the data after writing, which
- * actually prevents the read that triggered the error from finishing.
- * currently, there can be no more than two copies of every data bit. thus,
- * exactly one rewrite is required.
- */
 static int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 			     u64 length, u64 logical, struct page *page,
 			     unsigned int pg_offset, int mirror_num)
 {
-	struct btrfs_device *dev;
 	struct bio_vec bvec;
 	struct bio bio;
-	u64 map_length = 0;
-	u64 sector;
-	struct btrfs_io_context *bioc = NULL;
-	int ret = 0;
+	int ret;
 
 	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
 	BUG_ON(!mirror_num);
@@ -2336,67 +2322,19 @@ static int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 	if (btrfs_repair_one_zone(fs_info, logical))
 		return 0;
 
-	map_length = length;
-
-	/*
-	 * Avoid races with device replace and make sure our bioc has devices
-	 * associated to its stripes that don't go away while we are doing the
-	 * read repair operation.
-	 */
-	btrfs_bio_counter_inc_blocked(fs_info);
-	if (btrfs_is_parity_mirror(fs_info, logical, length)) {
-		/*
-		 * Note that we don't use BTRFS_MAP_WRITE because it's supposed
-		 * to update all raid stripes, but here we just want to correct
-		 * bad stripe, thus BTRFS_MAP_READ is abused to only get the bad
-		 * stripe's dev and sector.
-		 */
-		ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
-				      &map_length, &bioc, 0);
-		if (ret)
-			goto out_counter_dec;
-		ASSERT(bioc->mirror_num == 1);
-	} else {
-		ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical,
-				      &map_length, &bioc, mirror_num);
-		if (ret)
-			goto out_counter_dec;
-		BUG_ON(mirror_num != bioc->mirror_num);
-	}
-
-	sector = bioc->stripes[bioc->mirror_num - 1].physical >> 9;
-	dev = bioc->stripes[bioc->mirror_num - 1].dev;
-	btrfs_put_bioc(bioc);
-
-	if (!dev || !dev->bdev ||
-	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state)) {
-		ret = -EIO;
-		goto out_counter_dec;
-	}
-
-	bio_init(&bio, dev->bdev, &bvec, 1, REQ_OP_WRITE | REQ_SYNC);
-	bio.bi_iter.bi_sector = sector;
+	bio_init(&bio, NULL, &bvec, 1, REQ_OP_WRITE | REQ_SYNC);
+	bio.bi_iter.bi_sector = logical >> 9;
 	__bio_add_page(&bio, page, length, pg_offset);
+	ret = btrfs_map_repair_bio(fs_info, &bio, mirror_num);
+	bio_uninit(&bio);
 
-	btrfsic_check_bio(&bio);
-	ret = submit_bio_wait(&bio);
-	if (ret) {
-		/* try to remap that extent elsewhere? */
-		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
-		goto out_bio_uninit;
-	}
+	if (ret)
+		return ret;
 
 	btrfs_info_rl_in_rcu(fs_info,
-		"read error corrected: ino %llu off %llu (dev %s sector %llu)",
-				  ino, start,
-				  rcu_str_deref(dev->name), sector);
-	ret = 0;
-
-out_bio_uninit:
-	bio_uninit(&bio);
-out_counter_dec:
-	btrfs_bio_counter_dec(fs_info);
-	return ret;
+		"read error corrected: ino %llu off %llu (logical %llu)",
+			  ino, start, logical);
+	return 0;
 }
 
 int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1954a3e1c93a9..515f5fccf3d17 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6805,6 +6805,75 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	return errno_to_blk_status(ret);
 }
 
+/*
+ * This bypasses the standard btrfs submit functions deliberately, as the
+ * standard behavior is to write all copies in a raid setup. Here we only want
+ * to write the one bad copy.  Sso do the mapping ourselves and submit directly
+ * and synchronously.
+ */
+int btrfs_map_repair_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+		int mirror_num)
+{
+	u64 logical = bio->bi_iter.bi_sector << 9;
+	u64 map_length = bio->bi_iter.bi_size;
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_device *dev;
+	u64 sector;
+	int ret;
+
+	ASSERT(mirror_num);
+	ASSERT(bio_op(bio) == REQ_OP_WRITE);
+
+	/*
+	 * Avoid races with device replace and make sure our bioc has devices
+	 * associated to its stripes that don't go away while we are doing the
+	 * read repair operation.
+	 */
+	btrfs_bio_counter_inc_blocked(fs_info);
+	if (btrfs_is_parity_mirror(fs_info, logical, map_length)) {
+		/*
+		 * Note that we don't use BTRFS_MAP_WRITE because it's supposed
+		 * to update all raid stripes, but here we just want to correct
+		 * bad stripe, thus BTRFS_MAP_READ is abused to only get the bad
+		 * stripe's dev and sector.
+		 */
+		ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
+				&map_length, &bioc, 0);
+		if (ret)
+			goto out_counter_dec;
+		ASSERT(bioc->mirror_num == 1);
+	} else {
+		ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical,
+				&map_length, &bioc, mirror_num);
+		if (ret)
+			goto out_counter_dec;
+		BUG_ON(mirror_num != bioc->mirror_num);
+	}
+
+	sector = bioc->stripes[bioc->mirror_num - 1].physical >> 9;
+	dev = bioc->stripes[bioc->mirror_num - 1].dev;
+	btrfs_put_bioc(bioc);
+
+	if (!dev || !dev->bdev ||
+	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state)) {
+		ret = -EIO;
+		goto out_counter_dec;
+	}
+
+	bio_set_dev(bio, dev->bdev);
+	bio->bi_iter.bi_sector = sector;
+
+	btrfsic_check_bio(bio);
+	submit_bio_wait(bio);
+
+	ret = blk_status_to_errno(bio->bi_status);
+	if (ret)
+		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
+out_counter_dec:
+	btrfs_bio_counter_dec(fs_info);
+	return ret;
+}
+
 static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
 				      const struct btrfs_fs_devices *fs_devices)
 {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 517288d46ecf5..00c87833ce841 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -565,6 +565,8 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 void btrfs_mapping_tree_free(struct extent_map_tree *tree);
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			   int mirror_num);
+int btrfs_map_repair_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+		int mirror_num);
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 		       fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/9] btrfs: support read bios in btrfs_map_repair_bio
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2022-05-27  8:43 ` [PATCH 3/9] btrfs: factor out a btrfs_map_repair_bio helper Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 5/9] btrfs: add new read repair infrastructure Christoph Hellwig
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

Enhance btrfs_map_repair_bio to also support reading so that we have a
single function dealing with all synchronous bio I/O for the repair code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/volumes.c | 48 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 515f5fccf3d17..9053b62af3607 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6805,6 +6805,11 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	return errno_to_blk_status(ret);
 }
 
+static void btrfs_bio_end_io_sync(struct bio *bio)
+{
+	complete(bio->bi_private);
+}
+
 /*
  * This bypasses the standard btrfs submit functions deliberately, as the
  * standard behavior is to write all copies in a raid setup. Here we only want
@@ -6814,15 +6819,17 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 int btrfs_map_repair_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		int mirror_num)
 {
+	enum btrfs_map_op op = btrfs_op(bio);
 	u64 logical = bio->bi_iter.bi_sector << 9;
 	u64 map_length = bio->bi_iter.bi_size;
+	bool is_raid56 = btrfs_is_parity_mirror(fs_info, logical, map_length);
 	struct btrfs_io_context *bioc = NULL;
+	unsigned int stripe_idx = 0;
 	struct btrfs_device *dev;
 	u64 sector;
 	int ret;
 
 	ASSERT(mirror_num);
-	ASSERT(bio_op(bio) == REQ_OP_WRITE);
 
 	/*
 	 * Avoid races with device replace and make sure our bioc has devices
@@ -6830,7 +6837,23 @@ int btrfs_map_repair_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	 * read repair operation.
 	 */
 	btrfs_bio_counter_inc_blocked(fs_info);
-	if (btrfs_is_parity_mirror(fs_info, logical, map_length)) {
+	if (is_raid56) {
+		if (op == BTRFS_MAP_READ) {
+			DECLARE_COMPLETION_ONSTACK(done);
+
+			ret = __btrfs_map_block(fs_info, op, logical,
+					&map_length, &bioc, mirror_num, 1);
+			if (ret)
+				goto out_counter_dec;
+
+			bio->bi_private = &done;
+			bio->bi_end_io = btrfs_bio_end_io_sync;
+			ret = raid56_parity_recover(bio, bioc,
+					map_length, mirror_num, 1);
+			wait_for_completion_io(&done);
+			goto out_bio_status;
+		}
+
 		/*
 		 * Note that we don't use BTRFS_MAP_WRITE because it's supposed
 		 * to update all raid stripes, but here we just want to correct
@@ -6843,19 +6866,24 @@ int btrfs_map_repair_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			goto out_counter_dec;
 		ASSERT(bioc->mirror_num == 1);
 	} else {
-		ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical,
-				&map_length, &bioc, mirror_num);
+		ret = btrfs_map_block(fs_info, op, logical, &map_length, &bioc,
+				mirror_num);
 		if (ret)
 			goto out_counter_dec;
-		BUG_ON(mirror_num != bioc->mirror_num);
+
+		if (op == BTRFS_MAP_WRITE) {
+			ASSERT(mirror_num == bioc->mirror_num);
+			stripe_idx = bioc->mirror_num - 1;
+		}
 	}
 
-	sector = bioc->stripes[bioc->mirror_num - 1].physical >> 9;
-	dev = bioc->stripes[bioc->mirror_num - 1].dev;
+	sector = bioc->stripes[stripe_idx].physical >> 9;
+	dev = bioc->stripes[stripe_idx].dev;
 	btrfs_put_bioc(bioc);
 
 	if (!dev || !dev->bdev ||
-	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state)) {
+	    (op == BTRFS_MAP_WRITE &&
+	     !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))) {
 		ret = -EIO;
 		goto out_counter_dec;
 	}
@@ -6865,9 +6893,9 @@ int btrfs_map_repair_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 
 	btrfsic_check_bio(bio);
 	submit_bio_wait(bio);
-
+out_bio_status:
 	ret = blk_status_to_errno(bio->bi_status);
-	if (ret)
+	if (ret && op == BTRFS_MAP_WRITE)
 		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
 out_counter_dec:
 	btrfs_bio_counter_dec(fs_info);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/9] btrfs: add new read repair infrastructure
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2022-05-27  8:43 ` [PATCH 4/9] btrfs: support read bios in btrfs_map_repair_bio Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 6/9] btrfs: use the new read repair code for direct I/O Christoph Hellwig
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

This adds a new read repair implementation for btrfs.  It is synchronous
in that the end I/O handlers call them, and will get back the results
instead of potentially getting multiple concurrent calls back into the
original end I/O handler.  The synchronous nature has the following
advantages:

 - there is no need for a per-I/O tree of I/O failures, as everything
   related to the I/O failure can be handled locally
 - not having separate repair end I/O helpers will in the future help
   to reuse the direct I/O bio from iomap for the actual submission and
   thus remove the btrfs_dio_private infrastructure

Because submitting many sector size synchronous I/Os would be very
slow when multiple sectors (or a whole read) fail, this new code instead
submits a single read and repair write bio for each contiguous section.
It uses clone of the bio to do that and thus does not need to allocate
any extra bio_vecs.  Note that this cloning is open coded instead of
using the block layer clone helpers as the clone is based on the save
iter in the btrfs_bio, and not bio.bi_iter, which at the point that the
repair code is called has been advanced by the low-level driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/Makefile      |   2 +-
 fs/btrfs/inode.c       |   2 +-
 fs/btrfs/read-repair.c | 248 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/read-repair.h |  33 ++++++
 fs/btrfs/super.c       |   9 +-
 5 files changed, 290 insertions(+), 4 deletions(-)
 create mode 100644 fs/btrfs/read-repair.c
 create mode 100644 fs/btrfs/read-repair.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 99f9995670ea3..0b2605c750cab 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o tree-mod-log.o
+	   subpage.o tree-mod-log.o read-repair.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 025444aba2847..e6195b9490b6b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7953,7 +7953,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 	blk_status_t ret;
 		
 	/* save the original iter for read repair */
-	if (btrfs_op(bio) == BTRFS_MAP_READ) {
+	if (btrfs_op(bio) == BTRFS_MAP_READ)
 		btrfs_bio(bio)->iter = bio->bi_iter;
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
new file mode 100644
index 0000000000000..b7010a4589953
--- /dev/null
+++ b/fs/btrfs/read-repair.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Christoph Hellwig.
+ */
+#include "ctree.h"
+#include "volumes.h"
+#include "read-repair.h"
+#include "btrfs_inode.h"
+
+static struct bio_set read_repair_bioset;
+
+static int next_mirror(struct btrfs_read_repair *rr, int cur_mirror)
+{
+	if (cur_mirror == rr->num_copies)
+		return cur_mirror + 1 - rr->num_copies;
+	return cur_mirror + 1;
+}
+
+static int prev_mirror(struct btrfs_read_repair *rr, int cur_mirror)
+{
+	if (cur_mirror == 1)
+		return rr->num_copies;
+	return cur_mirror - 1;
+}
+
+/*
+ * Clone a new bio from the src_bbio, using the saved iter in the btrfs_bio
+ * instead of using bio->bi_iter like the block layer cloning helpers.
+ */
+static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio,
+		u64 offset, u32 size, unsigned int op)
+{
+	struct btrfs_bio *bbio;
+	struct bio *bio;
+
+	bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS,
+			       &read_repair_bioset);
+	bio_set_flag(bio, BIO_CLONED);
+
+	bio->bi_io_vec = src_bbio->bio.bi_io_vec;
+	bio->bi_iter = src_bbio->iter;
+	bio_advance(bio, offset);
+	bio->bi_iter.bi_size = size;
+
+	bbio = btrfs_bio(bio);
+	memset(bbio, 0, offsetof(struct btrfs_bio, bio));
+	bbio->iter = bbio->bio.bi_iter;
+	bbio->file_offset = src_bbio->file_offset + offset;
+
+	return bbio;
+}
+
+static void btrfs_repair_one_mirror(struct btrfs_bio *read_bbio,
+		struct btrfs_bio *failed_bbio, struct inode *inode,
+		u32 good_size, int bad_mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_inode *bi = BTRFS_I(inode);
+	u64 logical = read_bbio->iter.bi_sector << SECTOR_SHIFT;
+	u64 file_offset = read_bbio->file_offset;
+	struct btrfs_bio *write_bbio;
+	int ret;
+
+	/*
+	 * For zoned file systems repair has to relocate the whole zone.
+	 */
+	if (btrfs_repair_one_zone(fs_info, logical))
+		return;
+
+	/*
+	 * Otherwise just clone good part of the read bio and write it back to
+	 * the previously bad mirror.
+	 */
+	write_bbio = btrfs_repair_bio_clone(read_bbio, 0, good_size,
+				REQ_OP_WRITE);
+	ret = btrfs_map_repair_bio(fs_info, &write_bbio->bio, bad_mirror);
+	bio_put(&write_bbio->bio);
+
+	btrfs_info_rl(fs_info,
+		"%s: root %lld ino %llu off %llu logical %llu/%u from good mirror %d",
+		ret ? "failed to correct read error" : "read error corrected",
+		bi->root->root_key.objectid, btrfs_ino(bi),
+		file_offset, logical, read_bbio->iter.bi_size, bad_mirror);
+}
+
+static bool btrfs_repair_read_bio(struct btrfs_bio *bbio,
+		struct btrfs_bio *failed_bbio, struct inode *inode,
+		u32 *good_size, int read_mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u32 start_offset = bbio->file_offset - failed_bbio->file_offset;
+	u8 csum[BTRFS_CSUM_SIZE];
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	u32 offset;
+
+	if (btrfs_map_repair_bio(fs_info, &bbio->bio, read_mirror))
+		return false;
+
+	*good_size = bbio->iter.bi_size;
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
+		return true;
+
+	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
+		u8 *expected_csum =
+			btrfs_csum_ptr(fs_info, failed_bbio->csum,
+					start_offset + offset);
+
+		if (btrfs_check_sector_csum(fs_info, bv.bv_page, bv.bv_offset,
+				csum, expected_csum)) {
+			/*
+			 * Just fail if checksum verification failed for the
+			 * very first sector.  Else return how much good data we
+			 * found so that we can only write back as much to the
+			 * bad mirror(s).
+			 */
+			if (offset == 0)
+				return false;
+			*good_size = offset;
+			break;
+		}
+	}
+
+	return true;
+}
+
+bool __btrfs_read_repair_finish(struct btrfs_read_repair *rr,
+		struct btrfs_bio *failed_bbio, struct inode *inode,
+		u64 end_offset, repair_endio_t endio)
+{
+	u8 first_mirror, bad_mirror, read_mirror;
+	u64 start_offset = rr->start_offset;
+	struct btrfs_bio *read_bbio = NULL;
+	bool uptodate = false;
+	u32 good_size;
+
+	bad_mirror = first_mirror = failed_bbio->mirror_num;
+	while ((read_mirror = next_mirror(rr, bad_mirror)) != first_mirror) {
+		if (read_bbio)
+			bio_put(&read_bbio->bio);
+
+		/*
+		 * Try to read the entire failed range from a presumably good
+		 * range.
+		 */
+		read_bbio = btrfs_repair_bio_clone(failed_bbio,
+				start_offset, end_offset - start_offset,
+				REQ_OP_READ);
+		if (!btrfs_repair_read_bio(read_bbio, failed_bbio, inode,
+				&good_size, read_mirror)) {
+			/*
+			 * If we failed to read any data at all, go straight to
+			 * the next mirror.
+			 */
+			bad_mirror = read_mirror;
+			continue;
+		}
+
+		/*
+		 * If we have some good data write it back to all the previously
+		 * bad mirrors.
+		 */
+		for (;;) {
+	    		btrfs_repair_one_mirror(read_bbio, failed_bbio, inode,
+					good_size, bad_mirror);
+			if (bad_mirror == first_mirror)
+				break;
+			bad_mirror = prev_mirror(rr, bad_mirror);
+		}
+
+		/*
+		 * If the whole bio was good, we are done now.
+		 */
+		if (good_size == read_bbio->iter.bi_size) {
+			uptodate = true;
+			break;
+		}
+
+		/*
+		 * Only the start of the bio was good. Complete the good bytes
+		 * and fix up the iter to cover bad sectors so that the bad
+		 * range can be passed to the endio handler n case there is no
+		 * good mirror left.
+		 */
+		if (endio)
+			endio(read_bbio, inode, true);
+		start_offset += good_size;
+		read_bbio->file_offset += good_size;
+		bio_advance_iter(&read_bbio->bio, &read_bbio->iter, good_size);
+
+		/*
+		 * Restart the loop now that we've made some progress.
+		 * 
+		 * This ensures we go back to mirrors that returned bad data for
+		 * earlier as they might have good data for subsequent sectors.
+		 */
+		first_mirror = bad_mirror = read_mirror;
+	}
+
+	if (endio)
+		endio(read_bbio, inode, uptodate);
+	bio_put(&read_bbio->bio);
+
+	rr->in_use = false;
+	return uptodate;
+}
+
+bool btrfs_read_repair_add(struct btrfs_read_repair *rr,
+		struct btrfs_bio *failed_bbio, struct inode *inode,
+		u64 start_offset)
+{
+	if (rr->in_use)
+		return true;
+
+	/*
+	 * Only set ->num_copies once as it must be the same for the whole
+	 * I/O that the repair code iterates over.
+	 */
+	if (!rr->num_copies) {
+		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+		rr->num_copies = btrfs_num_copies(fs_info,
+				failed_bbio->iter.bi_sector << SECTOR_SHIFT,
+				failed_bbio->iter.bi_size);
+	}
+
+	/*
+	 * If there is no other copy of the data to recovery from, give up now
+	 * and don't even try to build up a larget batch.
+	 */
+	if (rr->num_copies < 2)
+		return false;
+
+	rr->in_use = true;
+	rr->start_offset = start_offset;
+	return true;
+}
+
+int __init btrfs_read_repair_init(void)
+{
+	return bioset_init(&read_repair_bioset, BIO_POOL_SIZE,
+			offsetof(struct btrfs_bio, bio), 0);
+}
+
+void btrfs_read_repair_exit(void)
+{
+	bioset_exit(&read_repair_bioset);
+}
diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
new file mode 100644
index 0000000000000..20366f5f0a239
--- /dev/null
+++ b/fs/btrfs/read-repair.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BTRFS_READ_REPAIR_H
+#define BTRFS_READ_REPAIR_H
+
+struct btrfs_read_repair {
+	u64 start_offset;
+	bool in_use;
+	int num_copies;
+};
+
+typedef void (*repair_endio_t)(struct btrfs_bio *repair_bbio,
+		struct inode *inode, bool uptodate);
+
+bool btrfs_read_repair_add(struct btrfs_read_repair *rr,
+		struct btrfs_bio *failed_bbio, struct inode *inode,
+		u64 bio_offset);
+bool __btrfs_read_repair_finish(struct btrfs_read_repair *rr,
+		struct btrfs_bio *failed_bbio, struct inode *inode,
+		u64 end_offset, repair_endio_t end_io);
+static inline bool btrfs_read_repair_finish(struct btrfs_read_repair *rr,
+		struct btrfs_bio *failed_bbio, struct inode *inode,
+		u64 end_offset, repair_endio_t endio)
+{
+	if (!rr->in_use)
+		return true;
+	return __btrfs_read_repair_finish(rr, failed_bbio, inode, end_offset,
+			endio);
+}
+
+int __init btrfs_read_repair_init(void);
+void btrfs_read_repair_exit(void);
+
+#endif /* BTRFS_READ_REPAIR_H */
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 938f42e53a8f8..11270bd78cd71 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -48,6 +48,7 @@
 #include "block-group.h"
 #include "discard.h"
 #include "qgroup.h"
+#include "read-repair.h"
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
 
@@ -2645,10 +2646,12 @@ static int __init init_btrfs_fs(void)
 	err = extent_io_init();
 	if (err)
 		goto free_cachep;
-
-	err = extent_state_cache_init();
+	err = btrfs_read_repair_init();
 	if (err)
 		goto free_extent_io;
+	err = extent_state_cache_init();
+	if (err)
+		goto free_read_repair;
 
 	err = extent_map_init();
 	if (err)
@@ -2706,6 +2709,8 @@ static int __init init_btrfs_fs(void)
 	extent_map_exit();
 free_extent_state_cache:
 	extent_state_cache_exit();
+free_read_repair:
+	btrfs_read_repair_exit();
 free_extent_io:
 	extent_io_exit();
 free_cachep:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/9] btrfs: use the new read repair code for direct I/O
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2022-05-27  8:43 ` [PATCH 5/9] btrfs: add new read repair infrastructure Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 7/9] btrfs: use the new read repair code for buffered reads Christoph Hellwig
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

Rewrite btrfs_check_read_dio_bio to use btrfs_bio_for_each_sector and
start/end a repair session as needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6195b9490b6b..76575b1bf30ad 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -55,6 +55,7 @@
 #include "zoned.h"
 #include "subpage.h"
 #include "inode-item.h"
+#include "read-repair.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -7853,55 +7854,34 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	bio_endio(&dip->bio);
 }
 
-static void submit_dio_repair_bio(struct inode *inode, struct bio *bio,
-				  int mirror_num,
-				  enum btrfs_compression_type compress_type)
-{
-	struct btrfs_dio_private *dip = bio->bi_private;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
-
-	refcount_inc(&dip->refs);
-	if (btrfs_map_bio(fs_info, bio, mirror_num))
-		refcount_dec(&dip->refs);
-}
-
 static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 					     struct btrfs_bio *bbio,
 					     const bool uptodate)
 {
 	struct inode *inode = dip->inode;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
+	struct btrfs_read_repair rr = { };
 	blk_status_t err = BLK_STS_OK;
 	struct bvec_iter iter;
 	struct bio_vec bv;
 	u32 offset;
 
 	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
-		u64 start = bbio->file_offset + offset;
-
 		if (uptodate &&
 		    (!csum || !check_data_csum(inode, bbio, offset, bv.bv_page,
-					       bv.bv_offset, start))) {
-			clean_io_failure(fs_info, failure_tree, io_tree, start,
-					 bv.bv_page, btrfs_ino(BTRFS_I(inode)),
-					 bv.bv_offset);
+				bv.bv_offset, bbio->file_offset + offset))) {
+			if (!btrfs_read_repair_finish(&rr, bbio, inode, offset,
+					NULL))
+				err = BLK_STS_IOERR;
 		} else {
-			int ret;
-
-			ret = btrfs_repair_one_sector(inode, &bbio->bio, offset,
-					bv.bv_page, bv.bv_offset, start,
-					bbio->mirror_num,
-					submit_dio_repair_bio);
-			if (ret)
-				err = errno_to_blk_status(ret);
+			if (!btrfs_read_repair_add(&rr, bbio, inode, offset))
+				err = BLK_STS_IOERR;
 		}
 	}
 
+	if (!btrfs_read_repair_finish(&rr, bbio, inode, offset, NULL))
+		err = BLK_STS_IOERR;
 	return err;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 7/9] btrfs: use the new read repair code for buffered reads
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2022-05-27  8:43 ` [PATCH 6/9] btrfs: use the new read repair code for direct I/O Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-06-14 16:25   ` Josef Bacik
  2022-05-27  8:43 ` [PATCH 8/9] btrfs: remove io_failure_record infrastructure completely Christoph Hellwig
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

Start/end a repair session as needed in end_bio_extent_readpage and
submit_data_read_repair.  Unlike direct I/O, the buffered I/O handler
completes I/O on a per-sector basis and thus needs to pass an endio
handler to the repair code, which unlocks all pages and marks them
as either uptodate or not.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 76 ++++++++++++++++++++------------------------
 1 file changed, 35 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 27775031ed2d4..9d7835ba6d396 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -30,6 +30,7 @@
 #include "zoned.h"
 #include "block-group.h"
 #include "compression.h"
+#include "read-repair.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -2683,14 +2684,29 @@ static void end_sector_io(struct page *page, u64 offset, bool uptodate)
 				    offset + sectorsize - 1, &cached);
 }
 
+static void end_read_repair(struct btrfs_bio *repair_bbio, struct inode *inode,
+		bool uptodate)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	u32 offset;
+
+	btrfs_bio_for_each_sector(fs_info, bv, repair_bbio, iter, offset)
+		end_sector_io(bv.bv_page, repair_bbio->file_offset + offset,
+				uptodate);
+}
+
 static void submit_data_read_repair(struct inode *inode, struct bio *failed_bio,
 				    u32 bio_offset, const struct bio_vec *bvec,
-				    int failed_mirror, unsigned int error_bitmap)
+				    int failed_mirror,
+				    unsigned int error_bitmap,
+				    struct btrfs_read_repair *rr)
 {
-	const unsigned int pgoff = bvec->bv_offset;
+	struct btrfs_bio *failed_bbio = btrfs_bio(failed_bio);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
 	struct page *page = bvec->bv_page;
-	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
 	const u64 end = start + bvec->bv_len - 1;
 	const u32 sectorsize = fs_info->sectorsize;
 	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
@@ -2712,38 +2728,17 @@ static void submit_data_read_repair(struct inode *inode, struct bio *failed_bio,
 
 	/* Iterate through all the sectors in the range */
 	for (i = 0; i < nr_bits; i++) {
-		const unsigned int offset = i * sectorsize;
-		bool uptodate = false;
-		int ret;
-
-		if (!(error_bitmap & (1U << i))) {
-			/*
-			 * This sector has no error, just end the page read
-			 * and unlock the range.
-			 */
-			uptodate = true;
-			goto next;
+		bool uptodate = !(error_bitmap & (1U << i));
+
+		if (uptodate ||
+		    !btrfs_read_repair_add(rr, failed_bbio, inode,
+		    		bio_offset)) {
+			btrfs_read_repair_finish(rr, failed_bbio, inode,
+					bio_offset, end_read_repair);
+			end_sector_io(page, start, uptodate);
 		}
-
-		ret = btrfs_repair_one_sector(inode, failed_bio,
-				bio_offset + offset,
-				page, pgoff + offset, start + offset,
-				failed_mirror, btrfs_submit_data_read_bio);
-		if (!ret) {
-			/*
-			 * We have submitted the read repair, the page release
-			 * will be handled by the endio function of the
-			 * submitted repair bio.
-			 * Thus we don't need to do any thing here.
-			 */
-			continue;
-		}
-		/*
-		 * Continue on failed repair, otherwise the remaining sectors
-		 * will not be properly unlocked.
-		 */
-next:
-		end_sector_io(page, start + offset, uptodate);
+		bio_offset += sectorsize;
+		start += sectorsize;
 	}
 }
 
@@ -2954,8 +2949,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 	struct bio_vec *bvec;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	int mirror = bbio->mirror_num;
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	bool uptodate = !bio->bi_status;
 	struct processed_extent processed = { 0 };
 	/*
@@ -2964,6 +2957,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 	 */
 	u32 bio_offset = 0;
 	struct bvec_iter_all iter_all;
+	struct btrfs_read_repair rr = { };
 
 	btrfs_bio(bio)->file_offset =
 		page_offset(first_vec->bv_page) + first_vec->bv_offset;
@@ -3020,10 +3014,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			loff_t i_size = i_size_read(inode);
 			pgoff_t end_index = i_size >> PAGE_SHIFT;
 
-			clean_io_failure(BTRFS_I(inode)->root->fs_info,
-					 failure_tree, tree, start, page,
-					 btrfs_ino(BTRFS_I(inode)), 0);
-
 			/*
 			 * Zero out the remaining part if this range straddles
 			 * i_size.
@@ -3063,9 +3053,11 @@ static void end_bio_extent_readpage(struct bio *bio)
 			 * and bad sectors, we just continue to the next bvec.
 			 */
 			submit_data_read_repair(inode, bio, bio_offset, bvec,
-						mirror, error_bitmap);
+						mirror, error_bitmap, &rr);
 		} else {
 			/* Update page status and unlock */
+			btrfs_read_repair_finish(&rr, btrfs_bio(bio), inode,
+					bio_offset, end_read_repair);
 			end_page_read(page, uptodate, start, len);
 			endio_readpage_release_extent(&processed, BTRFS_I(inode),
 					start, end, PageUptodate(page));
@@ -3076,6 +3068,8 @@ static void end_bio_extent_readpage(struct bio *bio)
 
 	}
 	/* Release the last extent */
+	btrfs_read_repair_finish(&rr, btrfs_bio(bio), inode, bio_offset,
+			end_read_repair);
 	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
 	btrfs_bio_free_csum(bbio);
 	bio_put(bio);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 8/9] btrfs: remove io_failure_record infrastructure completely
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2022-05-27  8:43 ` [PATCH 7/9] btrfs: use the new read repair code for buffered reads Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-05-27  8:43 ` [PATCH 9/9] btrfs: fold repair_io_failure into btrfs_repair_eb_io_failure Christoph Hellwig
  2022-06-06 21:25 ` simple synchronous read repair v2 David Sterba
  9 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

Since our read repair are always handled by btrfs_read_repair_ctrl,
which only has the lifespan inside endio function.

This means we no longer needs to record which range and its mirror
number for failure.

Now if we failed to read some data page, we have already tried every
mirrors we have, thus no need to record the failed range.

Thus this patch can remove the whole io_failure_record structure and its
related functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/btrfs_inode.h       |   5 -
 fs/btrfs/extent-io-tree.h    |  15 --
 fs/btrfs/extent_io.c         | 365 -----------------------------------
 fs/btrfs/extent_io.h         |  25 ---
 fs/btrfs/inode.c             |   7 -
 include/trace/events/btrfs.h |   1 -
 6 files changed, 418 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 33811e896623f..3eeba0eb9f16b 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -91,11 +91,6 @@ struct btrfs_inode {
 	/* the io_tree does range state (DIRTY, LOCKED etc) */
 	struct extent_io_tree io_tree;
 
-	/* special utility tree used to record which mirrors have already been
-	 * tried when checksums fail for a given block
-	 */
-	struct extent_io_tree io_failure_tree;
-
 	/*
 	 * Keep track of where the inode has extent items mapped in order to
 	 * make sure the i_size adjustments are accurate
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index c3eb52dbe61cc..8ab9b6cd53ed4 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -56,7 +56,6 @@ enum {
 	IO_TREE_FS_EXCLUDED_EXTENTS,
 	IO_TREE_BTREE_INODE_IO,
 	IO_TREE_INODE_IO,
-	IO_TREE_INODE_IO_FAILURE,
 	IO_TREE_RELOC_BLOCKS,
 	IO_TREE_TRANS_DIRTY_PAGES,
 	IO_TREE_ROOT_DIRTY_LOG_PAGES,
@@ -250,18 +249,4 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 			       u64 *end, u64 max_bytes,
 			       struct extent_state **cached_state);
 
-/* This should be reworked in the future and put elsewhere. */
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start);
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec);
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start,
-		u64 end);
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec);
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset);
-
 #endif /* BTRFS_EXTENT_IO_TREE_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9d7835ba6d396..c0cb3d4f5440f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2169,66 +2169,6 @@ u64 count_range_bits(struct extent_io_tree *tree,
 	return total_bytes;
 }
 
-/*
- * set the private field for a given byte offset in the tree.  If there isn't
- * an extent_state there already, this does nothing.
- */
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	int ret = 0;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state->failrec = failrec;
-out:
-	spin_unlock(&tree->lock);
-	return ret;
-}
-
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	struct io_failure_record *failrec;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-
-	failrec = state->failrec;
-out:
-	spin_unlock(&tree->lock);
-	return failrec;
-}
-
 /*
  * searches a range in the state tree for a given mask.
  * If 'filled' == 1, this returns 1 only if every extent in the tree
@@ -2285,30 +2225,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec)
-{
-	int ret;
-	int err = 0;
-
-	set_state_failrec(failure_tree, rec->start, NULL);
-	ret = clear_extent_bits(failure_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret)
-		err = ret;
-
-	ret = clear_extent_bits(io_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_DAMAGED);
-	if (ret && !err)
-		err = ret;
-
-	kfree(rec);
-	return err;
-}
-
 static int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 			     u64 length, u64 logical, struct page *page,
 			     unsigned int pg_offset, int mirror_num)
@@ -2361,287 +2277,6 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 	return ret;
 }
 
-/*
- * each time an IO finishes, we do a fast check in the IO failure tree
- * to see if we need to process or clean up an io_failure_record
- */
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset)
-{
-	u64 private;
-	struct io_failure_record *failrec;
-	struct extent_state *state;
-	int num_copies;
-	int ret;
-
-	private = 0;
-	ret = count_range_bits(failure_tree, &private, (u64)-1, 1,
-			       EXTENT_DIRTY, 0);
-	if (!ret)
-		return 0;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (IS_ERR(failrec))
-		return 0;
-
-	BUG_ON(!failrec->this_mirror);
-
-	if (sb_rdonly(fs_info->sb))
-		goto out;
-
-	spin_lock(&io_tree->lock);
-	state = find_first_extent_bit_state(io_tree,
-					    failrec->start,
-					    EXTENT_LOCKED);
-	spin_unlock(&io_tree->lock);
-
-	if (state && state->start <= failrec->start &&
-	    state->end >= failrec->start + failrec->len - 1) {
-		num_copies = btrfs_num_copies(fs_info, failrec->logical,
-					      failrec->len);
-		if (num_copies > 1)  {
-			repair_io_failure(fs_info, ino, start, failrec->len,
-					  failrec->logical, page, pg_offset,
-					  failrec->failed_mirror);
-		}
-	}
-
-out:
-	free_io_failure(failure_tree, io_tree, failrec);
-
-	return 0;
-}
-
-/*
- * Can be called when
- * - hold extent lock
- * - under ordered extent
- * - the inode is freeing
- */
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
-{
-	struct extent_io_tree *failure_tree = &inode->io_failure_tree;
-	struct io_failure_record *failrec;
-	struct extent_state *state, *next;
-
-	if (RB_EMPTY_ROOT(&failure_tree->state))
-		return;
-
-	spin_lock(&failure_tree->lock);
-	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
-	while (state) {
-		if (state->start > end)
-			break;
-
-		ASSERT(state->end <= end);
-
-		next = next_state(state);
-
-		failrec = state->failrec;
-		free_extent_state(state);
-		kfree(failrec);
-
-		state = next;
-	}
-	spin_unlock(&failure_tree->lock);
-}
-
-static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
-							     u64 start)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct io_failure_record *failrec;
-	struct extent_map *em;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	const u32 sectorsize = fs_info->sectorsize;
-	int ret;
-	u64 logical;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (!IS_ERR(failrec)) {
-		btrfs_debug(fs_info,
-	"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu",
-			failrec->logical, failrec->start, failrec->len);
-		/*
-		 * when data can be on disk more than twice, add to failrec here
-		 * (e.g. with a list for failed_mirror) to make
-		 * clean_io_failure() clean all those errors at once.
-		 */
-
-		return failrec;
-	}
-
-	failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
-	if (!failrec)
-		return ERR_PTR(-ENOMEM);
-
-	failrec->start = start;
-	failrec->len = sectorsize;
-	failrec->this_mirror = 0;
-	failrec->compress_type = BTRFS_COMPRESS_NONE;
-
-	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree, start, failrec->len);
-	if (!em) {
-		read_unlock(&em_tree->lock);
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	if (em->start > start || em->start + em->len <= start) {
-		free_extent_map(em);
-		em = NULL;
-	}
-	read_unlock(&em_tree->lock);
-	if (!em) {
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	logical = start - em->start;
-	logical = em->block_start + logical;
-	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
-		logical = em->block_start;
-		failrec->compress_type = em->compress_type;
-	}
-
-	btrfs_debug(fs_info,
-		    "Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu",
-		    logical, start, failrec->len);
-
-	failrec->logical = logical;
-	free_extent_map(em);
-
-	/* Set the bits in the private failure tree */
-	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
-			      EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret >= 0) {
-		ret = set_state_failrec(failure_tree, start, failrec);
-		/* Set the bits in the inode's tree */
-		ret = set_extent_bits(tree, start, start + sectorsize - 1,
-				      EXTENT_DAMAGED);
-	} else if (ret < 0) {
-		kfree(failrec);
-		return ERR_PTR(ret);
-	}
-
-	return failrec;
-}
-
-static bool btrfs_check_repairable(struct inode *inode,
-				   struct io_failure_record *failrec,
-				   int failed_mirror)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	int num_copies;
-
-	num_copies = btrfs_num_copies(fs_info, failrec->logical, failrec->len);
-	if (num_copies == 1) {
-		/*
-		 * we only have a single copy of the data, so don't bother with
-		 * all the retry and error correction code that follows. no
-		 * matter what the error is, it is very likely to persist.
-		 */
-		btrfs_debug(fs_info,
-			"Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	/* The failure record should only contain one sector */
-	ASSERT(failrec->len == fs_info->sectorsize);
-
-	/*
-	 * There are two premises:
-	 * a) deliver good data to the caller
-	 * b) correct the bad sectors on disk
-	 *
-	 * Since we're only doing repair for one sector, we only need to get
-	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for repair_io_failure to do the rest for us.
-	 */
-	ASSERT(failed_mirror);
-	failrec->failed_mirror = failed_mirror;
-	failrec->this_mirror++;
-	if (failrec->this_mirror == failed_mirror)
-		failrec->this_mirror++;
-
-	if (failrec->this_mirror > num_copies) {
-		btrfs_debug(fs_info,
-			"Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	return true;
-}
-
-int btrfs_repair_one_sector(struct inode *inode,
-			    struct bio *failed_bio, u32 bio_offset,
-			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook)
-{
-	struct io_failure_record *failrec;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct btrfs_bio *failed_bbio = btrfs_bio(failed_bio);
-	const int icsum = bio_offset >> fs_info->sectorsize_bits;
-	struct bio *repair_bio;
-	struct btrfs_bio *repair_bbio;
-
-	btrfs_debug(fs_info,
-		   "repair read error: read error at %llu", start);
-
-	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
-
-	failrec = btrfs_get_io_failure_record(inode, start);
-	if (IS_ERR(failrec))
-		return PTR_ERR(failrec);
-
-
-	if (!btrfs_check_repairable(inode, failrec, failed_mirror)) {
-		free_io_failure(failure_tree, tree, failrec);
-		return -EIO;
-	}
-
-	repair_bio = btrfs_bio_alloc(1);
-	repair_bbio = btrfs_bio(repair_bio);
-	repair_bbio->file_offset = start;
-	repair_bio->bi_opf = REQ_OP_READ;
-	repair_bio->bi_end_io = failed_bio->bi_end_io;
-	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
-	repair_bio->bi_private = failed_bio->bi_private;
-
-	if (failed_bbio->csum) {
-		const u32 csum_size = fs_info->csum_size;
-
-		repair_bbio->csum = repair_bbio->csum_inline;
-		memcpy(repair_bbio->csum,
-		       failed_bbio->csum + csum_size * icsum, csum_size);
-	}
-
-	bio_add_page(repair_bio, page, failrec->len, pgoff);
-	repair_bbio->iter = repair_bio->bi_iter;
-
-	btrfs_debug(btrfs_sb(inode->i_sb),
-		    "repair read error: submitting new read to mirror %d",
-		    failrec->this_mirror);
-
-	/*
-	 * At this point we have a bio, so any errors from submit_bio_hook()
-	 * will be handled by the endio on the repair_bio, so we can't return an
-	 * error here.
-	 */
-	submit_bio_hook(inode, repair_bio, failrec->this_mirror, failrec->compress_type);
-	return BLK_STS_OK;
-}
-
 static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 72966cf21961e..901f24cf2de28 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -61,7 +61,6 @@ struct btrfs_root;
 struct btrfs_inode;
 struct btrfs_io_bio;
 struct btrfs_fs_info;
-struct io_failure_record;
 struct extent_io_tree;
 
 typedef void (submit_bio_hook_t)(struct inode *inode, struct bio *bio,
@@ -252,30 +251,6 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size);
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the sector is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-	struct page *page;
-	u64 start;
-	u64 len;
-	u64 logical;
-	enum btrfs_compression_type compress_type;
-	int this_mirror;
-	int failed_mirror;
-};
-
-int btrfs_repair_one_sector(struct inode *inode,
-			    struct bio *failed_bio, u32 bio_offset,
-			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook);
-
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
 			     struct page *locked_page, u64 *start,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 76575b1bf30ad..b6186fc4466a6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3133,8 +3133,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 					ordered_extent->disk_num_bytes);
 	}
 
-	btrfs_free_io_failure_record(inode, start, end);
-
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
 		logical_len = ordered_extent->truncated_len;
@@ -5345,8 +5343,6 @@ void btrfs_evict_inode(struct inode *inode)
 	if (is_bad_inode(inode))
 		goto no_delete;
 
-	btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
-
 	if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
 		goto no_delete;
 
@@ -8818,12 +8814,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	inode = &ei->vfs_inode;
 	extent_map_tree_init(&ei->extent_tree);
 	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO, inode);
-	extent_io_tree_init(fs_info, &ei->io_failure_tree,
-			    IO_TREE_INODE_IO_FAILURE, inode);
 	extent_io_tree_init(fs_info, &ei->file_extent_tree,
 			    IO_TREE_INODE_FILE_EXTENT, inode);
 	ei->io_tree.track_uptodate = true;
-	ei->io_failure_tree.track_uptodate = true;
 	atomic_set(&ei->sync_writers, 0);
 	mutex_init(&ei->log_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 290f07eb050af..764e9643c123c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -82,7 +82,6 @@ struct btrfs_space_info;
 	EM( IO_TREE_FS_EXCLUDED_EXTENTS,  "EXCLUDED_EXTENTS")	    \
 	EM( IO_TREE_BTREE_INODE_IO,	  "BTREE_INODE_IO")	    \
 	EM( IO_TREE_INODE_IO,		  "INODE_IO")		    \
-	EM( IO_TREE_INODE_IO_FAILURE,	  "INODE_IO_FAILURE")	    \
 	EM( IO_TREE_RELOC_BLOCKS,	  "RELOC_BLOCKS")	    \
 	EM( IO_TREE_TRANS_DIRTY_PAGES,	  "TRANS_DIRTY_PAGES")      \
 	EM( IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES")   \
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 9/9] btrfs: fold repair_io_failure into btrfs_repair_eb_io_failure
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2022-05-27  8:43 ` [PATCH 8/9] btrfs: remove io_failure_record infrastructure completely Christoph Hellwig
@ 2022-05-27  8:43 ` Christoph Hellwig
  2022-06-06 21:25 ` simple synchronous read repair v2 David Sterba
  9 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:43 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: Qu Wenruo, linux-btrfs

Fold repair_io_failure into the only remaining caller.

This is still inefficient with single page I/Os, but I have some ideas
on how to improve the metadata repair in the future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 51 +++++++++++++++-----------------------------
 1 file changed, 17 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c0cb3d4f5440f..57a709262b730 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2225,35 +2225,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-static int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
-			     u64 length, u64 logical, struct page *page,
-			     unsigned int pg_offset, int mirror_num)
-{
-	struct bio_vec bvec;
-	struct bio bio;
-	int ret;
-
-	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
-	BUG_ON(!mirror_num);
-
-	if (btrfs_repair_one_zone(fs_info, logical))
-		return 0;
-
-	bio_init(&bio, NULL, &bvec, 1, REQ_OP_WRITE | REQ_SYNC);
-	bio.bi_iter.bi_sector = logical >> 9;
-	__bio_add_page(&bio, page, length, pg_offset);
-	ret = btrfs_map_repair_bio(fs_info, &bio, mirror_num);
-	bio_uninit(&bio);
-
-	if (ret)
-		return ret;
-
-	btrfs_info_rl_in_rcu(fs_info,
-		"read error corrected: ino %llu off %llu (logical %llu)",
-			  ino, start, logical);
-	return 0;
-}
-
 int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 {
 	struct btrfs_fs_info *fs_info = eb->fs_info;
@@ -2261,20 +2232,32 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 	int i, num_pages = num_extent_pages(eb);
 	int ret = 0;
 
+	WARN_ON_ONCE(!mirror_num);
+
 	if (sb_rdonly(fs_info->sb))
 		return -EROFS;
 
+	if (btrfs_repair_one_zone(fs_info, eb->start))
+		return 0;
+
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
+		struct bio_vec bvec;
+		struct bio bio;
+
+		bio_init(&bio, NULL, &bvec, 1, REQ_OP_WRITE | REQ_SYNC);
+		bio.bi_iter.bi_sector = start >> 9;
+		__bio_add_page(&bio, p, PAGE_SIZE, start - page_offset(p));
+		ret = btrfs_map_repair_bio(fs_info, &bio, mirror_num);
+		bio_uninit(&bio);
 
-		ret = repair_io_failure(fs_info, 0, start, PAGE_SIZE, start, p,
-					start - page_offset(p), mirror_num);
 		if (ret)
-			break;
-		start += PAGE_SIZE;
+			return ret;
 	}
 
-	return ret;
+	btrfs_info_rl_in_rcu(fs_info,
+		"metadata read error corrected: logical %llu.", eb->start);
+	return 0;
 }
 
 static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: simple synchronous read repair v2
  2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2022-05-27  8:43 ` [PATCH 9/9] btrfs: fold repair_io_failure into btrfs_repair_eb_io_failure Christoph Hellwig
@ 2022-06-06 21:25 ` David Sterba
  2022-06-06 22:53   ` Qu Wenruo
  9 siblings, 1 reply; 17+ messages in thread
From: David Sterba @ 2022-06-06 21:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Qu Wenruo, linux-btrfs

On Fri, May 27, 2022 at 10:43:11AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this is my take on the read repair code.  It borrow a lot of concepts
> and patches from Qu's attempt.  The big difference is that it does away
> with multiple in-flight repair bios, but instead just does one at a
> time, but tries to make it as big as possible.
> 
> My aim here is mostly to fix up the messy I/O completions for the
> direct I/O path, that make further bio optimizations rather annoying,
> but it also gives better I/O patterns for repair (although I'm not sure
> anyone cares) and removes a fair chunk of code.
> 
> [this is on top of a not commit yet series.  I kinda hate doing that, but
>  with all the hot repair discussion going on right now we might as well
>  have an uptodate version of this series out on the list]
> 
> Git tree:
> 
>    git://git.infradead.org/users/hch/misc.git btrfs-read_repair
> 
> Gitweb:
> 
>    http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-read_repair
> 
> Changes since v1:
>  - rebased on top of the "misc btrfs cleanups" series that contains
>    various patches previously in this series and the
>    "cleanup btrfs bio handling, part 2 v4" series
>  - handle partial reads due to checksum failures
>  - handle large I/O for parity RAID repair as well
>  - add a lot more comments
>  - rename btrfs_read_repair_end to btrfs_read_repair_finish

The preparatory patches and other bio patches are now in misc-next, so
I'm adding this branch for-next. At this point I'm cosidering this for
merge as it has better performance compared to the simple repair from
Qu, though I still need some time to finish the review.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: simple synchronous read repair v2
  2022-06-06 21:25 ` simple synchronous read repair v2 David Sterba
@ 2022-06-06 22:53   ` Qu Wenruo
  2022-06-07  6:16     ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2022-06-06 22:53 UTC (permalink / raw)
  To: dsterba, Christoph Hellwig, Chris Mason, Josef Bacik,
	David Sterba, Qu Wenruo, linux-btrfs



On 2022/6/7 05:25, David Sterba wrote:
> On Fri, May 27, 2022 at 10:43:11AM +0200, Christoph Hellwig wrote:
>> Hi all,
>>
>> this is my take on the read repair code.  It borrow a lot of concepts
>> and patches from Qu's attempt.  The big difference is that it does away
>> with multiple in-flight repair bios, but instead just does one at a
>> time, but tries to make it as big as possible.
>>
>> My aim here is mostly to fix up the messy I/O completions for the
>> direct I/O path, that make further bio optimizations rather annoying,
>> but it also gives better I/O patterns for repair (although I'm not sure
>> anyone cares) and removes a fair chunk of code.
>>
>> [this is on top of a not commit yet series.  I kinda hate doing that, but
>>   with all the hot repair discussion going on right now we might as well
>>   have an uptodate version of this series out on the list]
>>
>> Git tree:
>>
>>     git://git.infradead.org/users/hch/misc.git btrfs-read_repair
>>
>> Gitweb:
>>
>>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-read_repair
>>
>> Changes since v1:
>>   - rebased on top of the "misc btrfs cleanups" series that contains
>>     various patches previously in this series and the
>>     "cleanup btrfs bio handling, part 2 v4" series
>>   - handle partial reads due to checksum failures
>>   - handle large I/O for parity RAID repair as well
>>   - add a lot more comments
>>   - rename btrfs_read_repair_end to btrfs_read_repair_finish
>
> The preparatory patches and other bio patches are now in misc-next, so
> I'm adding this branch for-next. At this point I'm cosidering this for
> merge as it has better performance compared to the simple repair from
> Qu, though I still need some time to finish the review.

OK, although I'd say, considering the latest read-repair test, we may
want to simply the write part, to only write the data back to the
initial mirror.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: simple synchronous read repair v2
  2022-06-06 22:53   ` Qu Wenruo
@ 2022-06-07  6:16     ` Christoph Hellwig
  2022-06-07  6:34       ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2022-06-07  6:16 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dsterba, Christoph Hellwig, Chris Mason, Josef Bacik,
	David Sterba, Qu Wenruo, linux-btrfs

On Tue, Jun 07, 2022 at 06:53:35AM +0800, Qu Wenruo wrote:
> OK, although I'd say, considering the latest read-repair test,

What is the latest read repair test?

> we may
> want to simply the write part, to only write the data back to the
> initial mirror.

Why would we not write back the correct data to all known bad mirrors?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: simple synchronous read repair v2
  2022-06-07  6:16     ` Christoph Hellwig
@ 2022-06-07  6:34       ` Qu Wenruo
  2022-06-07  6:45         ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2022-06-07  6:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dsterba, Chris Mason, Josef Bacik, David Sterba, Qu Wenruo, linux-btrfs



On 2022/6/7 14:16, Christoph Hellwig wrote:
> On Tue, Jun 07, 2022 at 06:53:35AM +0800, Qu Wenruo wrote:
>> OK, although I'd say, considering the latest read-repair test,
>
> What is the latest read repair test?

IIRC some RAID1C3 cases like:

https://patchwork.kernel.org/project/linux-btrfs/patch/20220527081915.2024853-9-hch@lst.de/

As currently, we only write the recovered data back to *previous*
corrupted mirror, it doesn't ensure our initial mirror get repaired.

>
>> we may
>> want to simply the write part, to only write the data back to the
>> initial mirror.
>
> Why would we not write back the correct data to all known bad mirrors?

Because then you also need to record which mirrors are bad.

I doubt you want to bring a more complex version to write back to all
corrupted copies.

And in fact, only writing back to the initial mirror is also going to
make recovery faster and easier, as now we only need to do one
synchronous writeback.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: simple synchronous read repair v2
  2022-06-07  6:34       ` Qu Wenruo
@ 2022-06-07  6:45         ` Christoph Hellwig
  2022-06-07  8:13           ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2022-06-07  6:45 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, dsterba, Chris Mason, Josef Bacik,
	David Sterba, Qu Wenruo, linux-btrfs

On Tue, Jun 07, 2022 at 02:34:02PM +0800, Qu Wenruo wrote:
> As currently, we only write the recovered data back to *previous*
> corrupted mirror, it doesn't ensure our initial mirror get repaired.

Yes.  So it leaves the file system in a degraded state and reduces
redundancy.

> And in fact, only writing back to the initial mirror is also going to
> make recovery faster and easier, as now we only need to do one
> synchronous writeback.

It does make recovery faster by not doing most of it.  By that logic we
should stop repairing in the I/O path at all.  Which, thinking about it,
might actually not be a bad idea.  Instead of doing synchronous repair
from the I/O path, just serve the actual I/O with reads and instead
kick of a background scrub process to fix it.  This removes code
duplication between read repair and scrub, speeds up the reads
themselves, including reducing the amount of I/O we might be doing
under memory pressure and allows the scrube code to actually do the
right thing everywhere, including rebuilding parity in the RAID6
case, something that the read repair (both old and new) do not handle
at all.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: simple synchronous read repair v2
  2022-06-07  6:45         ` Christoph Hellwig
@ 2022-06-07  8:13           ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2022-06-07  8:13 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: dsterba, Chris Mason, Josef Bacik, David Sterba, linux-btrfs



On 2022/6/7 14:45, Christoph Hellwig wrote:
> On Tue, Jun 07, 2022 at 02:34:02PM +0800, Qu Wenruo wrote:
>> As currently, we only write the recovered data back to *previous*
>> corrupted mirror, it doesn't ensure our initial mirror get repaired.
>
> Yes.  So it leaves the file system in a degraded state and reduces
> redundancy.
>
>> And in fact, only writing back to the initial mirror is also going to
>> make recovery faster and easier, as now we only need to do one
>> synchronous writeback.
>
> It does make recovery faster by not doing most of it.  By that logic we
> should stop repairing in the I/O path at all.  Which, thinking about it,
> might actually not be a bad idea.  Instead of doing synchronous repair
> from the I/O path, just serve the actual I/O with reads and instead
> kick of a background scrub process to fix it.

That has its own problem, mostly due to the lack of granularity from scrub.

Currently scrub can only work at a device level, which means we need to
scrub the whole device just for several corrupted sectors.

I got your point, but unfortunately unless we greatly enhanced the
granularity of scrub, it's not yet feasible for now, thus I'm afraid we
have to go the current path for near future.

I totally understand there are tons of scrub related code needs to be
improved, from the damn stupid multiple devices scrubbing method to how
to handle RAID56 scrub better.

And that would be my next main target, as it is the basis for mount time
scrub (thus for write-intent bitmap tree).

>  This removes code
> duplication between read repair and scrub, speeds up the reads
> themselves, including reducing the amount of I/O we might be doing
> under memory pressure and allows the scrube code to actually do the
> right thing everywhere, including rebuilding parity in the RAID6
> case, something that the read repair (both old and new) do not handle
> at all.
>

Yep, for RAID56 it's always a pain in the backend.

But let's focus on traditional mirror based ones for now, and I still
think just writing back the initial mirror is not a bad idea for this
particular case.

Another alternative is to write back the recovered data back to any
corrupted mirrors asynchronously.
(This means more complex code though)

For the existing at most 2 mirrors profiles (DUP,RAID1,RAID10), this is
no different than the existing behavior anyway.

For RAID1C3/C4, those two profiles are new profiles (which we didn't
really test read-repair that well).
Defining a better and determining behavior is what we should do when
introducing those two profiles.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/9] btrfs: use the new read repair code for buffered reads
  2022-05-27  8:43 ` [PATCH 7/9] btrfs: use the new read repair code for buffered reads Christoph Hellwig
@ 2022-06-14 16:25   ` Josef Bacik
  0 siblings, 0 replies; 17+ messages in thread
From: Josef Bacik @ 2022-06-14 16:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chris Mason, David Sterba, Qu Wenruo, linux-btrfs

On Fri, May 27, 2022 at 10:43:18AM +0200, Christoph Hellwig wrote:
> Start/end a repair session as needed in end_bio_extent_readpage and
> submit_data_read_repair.  Unlike direct I/O, the buffered I/O handler
> completes I/O on a per-sector basis and thus needs to pass an endio
> handler to the repair code, which unlocks all pages and marks them
> as either uptodate or not.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/extent_io.c | 76 ++++++++++++++++++++------------------------
>  1 file changed, 35 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 27775031ed2d4..9d7835ba6d396 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -30,6 +30,7 @@
>  #include "zoned.h"
>  #include "block-group.h"
>  #include "compression.h"
> +#include "read-repair.h"
>  
>  static struct kmem_cache *extent_state_cache;
>  static struct kmem_cache *extent_buffer_cache;
> @@ -2683,14 +2684,29 @@ static void end_sector_io(struct page *page, u64 offset, bool uptodate)
>  				    offset + sectorsize - 1, &cached);
>  }
>  
> +static void end_read_repair(struct btrfs_bio *repair_bbio, struct inode *inode,
> +		bool uptodate)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct bvec_iter iter;
> +	struct bio_vec bv;
> +	u32 offset;
> +
> +	btrfs_bio_for_each_sector(fs_info, bv, repair_bbio, iter, offset)
> +		end_sector_io(bv.bv_page, repair_bbio->file_offset + offset,
> +				uptodate);
> +}
> +
>  static void submit_data_read_repair(struct inode *inode, struct bio *failed_bio,
>  				    u32 bio_offset, const struct bio_vec *bvec,
> -				    int failed_mirror, unsigned int error_bitmap)
> +				    int failed_mirror,
> +				    unsigned int error_bitmap,
> +				    struct btrfs_read_repair *rr)
>  {
> -	const unsigned int pgoff = bvec->bv_offset;
> +	struct btrfs_bio *failed_bbio = btrfs_bio(failed_bio);
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
>  	struct page *page = bvec->bv_page;
> -	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
>  	const u64 end = start + bvec->bv_len - 1;
>  	const u32 sectorsize = fs_info->sectorsize;
>  	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
> @@ -2712,38 +2728,17 @@ static void submit_data_read_repair(struct inode *inode, struct bio *failed_bio,
>  
>  	/* Iterate through all the sectors in the range */
>  	for (i = 0; i < nr_bits; i++) {
> -		const unsigned int offset = i * sectorsize;
> -		bool uptodate = false;
> -		int ret;
> -
> -		if (!(error_bitmap & (1U << i))) {
> -			/*
> -			 * This sector has no error, just end the page read
> -			 * and unlock the range.
> -			 */
> -			uptodate = true;
> -			goto next;
> +		bool uptodate = !(error_bitmap & (1U << i));
> +
> +		if (uptodate ||
> +		    !btrfs_read_repair_add(rr, failed_bbio, inode,
> +		    		bio_offset)) {
> +			btrfs_read_repair_finish(rr, failed_bbio, inode,
> +					bio_offset, end_read_repair);
> +			end_sector_io(page, start, uptodate);
>  		}
> -
> -		ret = btrfs_repair_one_sector(inode, failed_bio,
> -				bio_offset + offset,
> -				page, pgoff + offset, start + offset,
> -				failed_mirror, btrfs_submit_data_read_bio);
> -		if (!ret) {
> -			/*
> -			 * We have submitted the read repair, the page release
> -			 * will be handled by the endio function of the
> -			 * submitted repair bio.
> -			 * Thus we don't need to do any thing here.
> -			 */
> -			continue;
> -		}
> -		/*
> -		 * Continue on failed repair, otherwise the remaining sectors
> -		 * will not be properly unlocked.
> -		 */
> -next:
> -		end_sector_io(page, start + offset, uptodate);
> +		bio_offset += sectorsize;
> +		start += sectorsize;
>  	}
>  }
>  
> @@ -2954,8 +2949,6 @@ static void end_bio_extent_readpage(struct bio *bio)
>  	struct bio_vec *bvec;
>  	struct btrfs_bio *bbio = btrfs_bio(bio);
>  	int mirror = bbio->mirror_num;
> -	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> -	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
>  	bool uptodate = !bio->bi_status;
>  	struct processed_extent processed = { 0 };
>  	/*
> @@ -2964,6 +2957,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>  	 */
>  	u32 bio_offset = 0;
>  	struct bvec_iter_all iter_all;
> +	struct btrfs_read_repair rr = { };
>  
>  	btrfs_bio(bio)->file_offset =
>  		page_offset(first_vec->bv_page) + first_vec->bv_offset;
> @@ -3020,10 +3014,6 @@ static void end_bio_extent_readpage(struct bio *bio)
>  			loff_t i_size = i_size_read(inode);
>  			pgoff_t end_index = i_size >> PAGE_SHIFT;
>  
> -			clean_io_failure(BTRFS_I(inode)->root->fs_info,
> -					 failure_tree, tree, start, page,
> -					 btrfs_ino(BTRFS_I(inode)), 0);
> -
>  			/*
>  			 * Zero out the remaining part if this range straddles
>  			 * i_size.
> @@ -3063,9 +3053,11 @@ static void end_bio_extent_readpage(struct bio *bio)
>  			 * and bad sectors, we just continue to the next bvec.
>  			 */
>  			submit_data_read_repair(inode, bio, bio_offset, bvec,
> -						mirror, error_bitmap);
> +						mirror, error_bitmap, &rr);
>  		} else {
>  			/* Update page status and unlock */
> +			btrfs_read_repair_finish(&rr, btrfs_bio(bio), inode,
> +					bio_offset, end_read_repair);
>  			end_page_read(page, uptodate, start, len);
>  			endio_readpage_release_extent(&processed, BTRFS_I(inode),
>  					start, end, PageUptodate(page));
> @@ -3076,6 +3068,8 @@ static void end_bio_extent_readpage(struct bio *bio)
>  
>  	}
>  	/* Release the last extent */
> +	btrfs_read_repair_finish(&rr, btrfs_bio(bio), inode, bio_offset,
> +			end_read_repair);

This part is wrong, it's leading to BUG_ON(!locked_page(page)) when we do the
end_io.  We're essentially read_repair_finish for a page that is outside of our
bio.  The continuous testing caught this last week but I didn't notice it until
yesterday.  I pulled this part out and now btrfs/101 runs without panicing.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-06-14 16:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-27  8:43 simple synchronous read repair v2 Christoph Hellwig
2022-05-27  8:43 ` [PATCH 1/9] btrfs: save the original bi_iter into btrfs_bio for buffered read Christoph Hellwig
2022-05-27  8:43 ` [PATCH 2/9] btrfs: set ->file_offset in end_bio_extent_readpage Christoph Hellwig
2022-05-27  8:43 ` [PATCH 3/9] btrfs: factor out a btrfs_map_repair_bio helper Christoph Hellwig
2022-05-27  8:43 ` [PATCH 4/9] btrfs: support read bios in btrfs_map_repair_bio Christoph Hellwig
2022-05-27  8:43 ` [PATCH 5/9] btrfs: add new read repair infrastructure Christoph Hellwig
2022-05-27  8:43 ` [PATCH 6/9] btrfs: use the new read repair code for direct I/O Christoph Hellwig
2022-05-27  8:43 ` [PATCH 7/9] btrfs: use the new read repair code for buffered reads Christoph Hellwig
2022-06-14 16:25   ` Josef Bacik
2022-05-27  8:43 ` [PATCH 8/9] btrfs: remove io_failure_record infrastructure completely Christoph Hellwig
2022-05-27  8:43 ` [PATCH 9/9] btrfs: fold repair_io_failure into btrfs_repair_eb_io_failure Christoph Hellwig
2022-06-06 21:25 ` simple synchronous read repair v2 David Sterba
2022-06-06 22:53   ` Qu Wenruo
2022-06-07  6:16     ` Christoph Hellwig
2022-06-07  6:34       ` Qu Wenruo
2022-06-07  6:45         ` Christoph Hellwig
2022-06-07  8:13           ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.