All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] btrfs: read-repair rework based on bitmap
@ 2022-05-25 10:59 Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 1/7] btrfs: save the original bi_iter into btrfs_bio for buffered read Qu Wenruo
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Qu Wenruo @ 2022-05-25 10:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Christoph Hellwig

This is the bitmap version revivied, and based on Christoph's
cleanup series.

The branch can be feteched from my repo:
https://github.com/adam900710/linux/tree/read_repair

Changelog:
v2:
- Still go bitmap version to get batched submission

- Instead of using the old failed bio to hold all the pages, here we
  use an oridnary bio to hold only the corrupted sectors.

- Unlike previous bitmap version, this time no memory allocation
  The needed bitmap will be on-stack, and only limit to 32bits.

- Go synchronous submission for read and write

- Only use bitmap for the corrupted range.
  Unlike previous version, which allocate bio for the whole bio.
  That's a waste of memory.

- btrfs_read_repair_sector() will submit the existing repair early
  For uncontinuous new sector or if we have reached size limit
  (due to bitmap size limit, now it's 128K for 4K sectorsize).

  This allow us to use fixed bitmap size.


The core function is still btrfs_read_repair_finish().

But now, btrfs_read_repair_finish() either get called without any
corrupted sectors, or it will only face a continuous range of corrupted
sectors.

Then we handle the range by iterating all the remaining mirrors.

For each mirror we go:

1) Try to add current bad sector into our io_bio.
   If our io_bio is not continuous, we just submit current io_bio and
   wait for it.
   Then add the new sector into it.

2) If the io_bio is not empty, submit it.

By 1) and 2), we will read all bad sectors from the new mirror.

3) Check if the data is fine and update our ctrl::bad_bitmap

We either end with all sectors repaired, or all mirrors exhausted.

The advantage of bitmap method is, we only try at most (num_copies - 1)
times, no matter the corruption pattern.

On the other hand, for the worst case we're still doing sector by sector
repair.
For the best case (aka, continuous corruption cases), we still do
batched bio submission, thus still way better than sector-by-sector
repair.

Furthermore, all loops in the code are regular for() loops, no hacking
on how we loop.

But I have to admit, even the repair_from_mirror() and
btrfs_read_repair_finish() is super easy to read, the details on bio
page adding and submission are all hidden into io_add_or_submit() and
btrfs_read_repair_add_sector().

Although it has better worst case performance, it's no better than
sector-by-sector repair in worst case scenario.

Cc: Christoph Hellwig <hch@lst.de>


Christoph Hellwig (1):
  btrfs: add a btrfs_map_bio_wait helper

Qu Wenruo (6):
  btrfs: save the original bi_iter into btrfs_bio for buffered read
  btrfs: make repair_io_failure available outside of extent_io.c
  btrfs: introduce new read-repair infrastructure
  btrfs: make buffered read path to use the new read repair
    infrastructure
  btrfs: make direct io read path to use the new read repair
    infrastructure
  btrfs: remove io_failure_record infrastructure completely

 fs/btrfs/Makefile            |   2 +-
 fs/btrfs/btrfs_inode.h       |   5 -
 fs/btrfs/extent-io-tree.h    |  15 --
 fs/btrfs/extent_io.c         | 436 +++--------------------------------
 fs/btrfs/extent_io.h         |  28 +--
 fs/btrfs/inode.c             |  60 ++---
 fs/btrfs/read-repair.c       | 328 ++++++++++++++++++++++++++
 fs/btrfs/read-repair.h       |  48 ++++
 fs/btrfs/volumes.c           |  21 ++
 fs/btrfs/volumes.h           |   2 +
 include/trace/events/btrfs.h |   1 -
 11 files changed, 458 insertions(+), 488 deletions(-)
 create mode 100644 fs/btrfs/read-repair.c
 create mode 100644 fs/btrfs/read-repair.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/7] btrfs: save the original bi_iter into btrfs_bio for buffered read
  2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
@ 2022-05-25 10:59 ` Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 2/7] btrfs: make repair_io_failure available outside of extent_io.c Qu Wenruo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2022-05-25 10:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Christoph Hellwig

Although we have btrfs_bio::iter, it currently have very limited usage:

- RAID56
  Which is not needed at all

- btrfs_bio_clone()
  This is used mostly for direct IO.

For the incoming read repair patches, we want to grab the original
logical bytenr, and be able to iterate the range of the bio (no matter
if it's cloned).

So this patch will also save btrfs_bio::iter for buffered read bios at
submit_one_bio().
And for the sake of consistency, also save the btrfs_bio::iter for
direct IO at btrfs_submit_dio_bio().

The reason that we didn't save the iter in btrfs_map_bio() is,
btrfs_map_bio() is going to handle various bios, with or without
btrfs_bio bioset.
And we  want to keep btrfs_map_bio() to handle and only handle plain bios
without bother the bioset.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 7 +++++++
 fs/btrfs/inode.c     | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1d144f655f65..1bd1b1253f9d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -188,6 +188,13 @@ static void submit_one_bio(struct bio *bio, int mirror_num,
 	/* Caller should ensure the bio has at least some range added */
 	ASSERT(bio->bi_iter.bi_size);
 
+	/*
+	 * Save the original bi_iter for read bios, as read repair wants the
+	 * orignial logical bytenr.
+	 */
+	if (bio_op(bio) == REQ_OP_READ)
+		btrfs_bio(bio)->iter = bio->bi_iter;
+
 	if (is_data_inode(tree->private_data))
 		btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    compress_type);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 34466b543ed9..dd0882e1b982 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7974,6 +7974,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 		ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
 		if (ret)
 			goto err;
+		/* Check submit_one_bio() for the reason. */
+		btrfs_bio(bio)->iter = bio->bi_iter;
 	}
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/7] btrfs: make repair_io_failure available outside of extent_io.c
  2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 1/7] btrfs: save the original bi_iter into btrfs_bio for buffered read Qu Wenruo
@ 2022-05-25 10:59 ` Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 3/7] btrfs: add a btrfs_map_bio_wait helper Qu Wenruo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2022-05-25 10:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Christoph Hellwig, Johannes Thumshirn

Remove the static so that the function can be used by the new read
repair code, and give it a btrfs_ prefix.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent_io.c | 19 ++++++++++---------
 fs/btrfs/extent_io.h |  3 +++
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1bd1b1253f9d..1083d6cfa858 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2321,9 +2321,9 @@ int free_io_failure(struct extent_io_tree *failure_tree,
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
-static int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
-			     u64 length, u64 logical, struct page *page,
-			     unsigned int pg_offset, int mirror_num)
+int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
+			    u64 length, u64 logical, struct page *page,
+			    unsigned int pg_offset, int mirror_num)
 {
 	struct btrfs_device *dev;
 	struct bio_vec bvec;
@@ -2415,8 +2415,9 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
-		ret = repair_io_failure(fs_info, 0, start, PAGE_SIZE, start, p,
-					start - page_offset(p), mirror_num);
+		ret = btrfs_repair_io_failure(fs_info, 0, start, PAGE_SIZE,
+					      start, p, start - page_offset(p),
+					      mirror_num);
 		if (ret)
 			break;
 		start += PAGE_SIZE;
@@ -2466,9 +2467,9 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
-			repair_io_failure(fs_info, ino, start, failrec->len,
-					  failrec->logical, page, pg_offset,
-					  failrec->failed_mirror);
+			btrfs_repair_io_failure(fs_info, ino, start,
+					failrec->len, failrec->logical,
+					page, pg_offset, failrec->failed_mirror);
 		}
 	}
 
@@ -2626,7 +2627,7 @@ static bool btrfs_check_repairable(struct inode *inode,
 	 *
 	 * Since we're only doing repair for one sector, we only need to get
 	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for repair_io_failure to do the rest for us.
+	 * everything for btrfs_repair_io_failure() to do the rest for us.
 	 */
 	ASSERT(failed_mirror);
 	failrec->failed_mirror = failed_mirror;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 956fa434df43..6cdcea1551a6 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -276,6 +276,9 @@ int btrfs_repair_one_sector(struct inode *inode,
 			    struct page *page, unsigned int pgoff,
 			    u64 start, int failed_mirror,
 			    submit_bio_hook_t *submit_bio_hook);
+int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
+			    u64 length, u64 logical, struct page *page,
+			    unsigned int pg_offset, int mirror_num);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/7] btrfs: add a btrfs_map_bio_wait helper
  2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 1/7] btrfs: save the original bi_iter into btrfs_bio for buffered read Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 2/7] btrfs: make repair_io_failure available outside of extent_io.c Qu Wenruo
@ 2022-05-25 10:59 ` Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure Qu Wenruo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2022-05-25 10:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Christoph Hellwig, Johannes Thumshirn

From: Christoph Hellwig <hch@lst.de>

This helpers works like submit_bio_wait, but goes through the btrfs bio
mapping using btrfs_map_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/volumes.c | 21 +++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 23 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0819db46dbc4..8925bc606db7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6818,6 +6818,27 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	return BLK_STS_OK;
 }
 
+static void btrfs_end_io_sync(struct bio *bio)
+{
+	complete(bio->bi_private);
+}
+
+blk_status_t btrfs_map_bio_wait(struct btrfs_fs_info *fs_info, struct bio *bio,
+		int mirror)
+{
+	DECLARE_COMPLETION_ONSTACK(done);
+	blk_status_t ret;
+
+	bio->bi_private = &done;
+	bio->bi_end_io = btrfs_end_io_sync;
+	ret = btrfs_map_bio(fs_info, bio, mirror);
+	if (ret)
+		return ret;
+
+	wait_for_completion_io(&done);
+	return bio->bi_status;
+}
+
 static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
 				      const struct btrfs_fs_devices *fs_devices)
 {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6f784d4f5466..b346f6c40151 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -555,6 +555,8 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 void btrfs_mapping_tree_free(struct extent_map_tree *tree);
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			   int mirror_num);
+blk_status_t btrfs_map_bio_wait(struct btrfs_fs_info *fs_info, struct bio *bio,
+		int mirror);
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 		       fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-05-25 10:59 ` [PATCH v2 3/7] btrfs: add a btrfs_map_bio_wait helper Qu Wenruo
@ 2022-05-25 10:59 ` Qu Wenruo
  2022-05-26  3:06   ` Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 5/7] btrfs: make buffered read path to use the new read repair infrastructure Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2022-05-25 10:59 UTC (permalink / raw)
  To: linux-btrfs

The new read repair infrastructure is consist of the following 3 parts:

- btrfs_read_repair_ctrl
  Record a continous corrupted range.
  Will mostly be on-stack structure for the top-level endio function.

- btrfs_read_repair_add_sector()
  This function is called each time we hit a bad sector.

  This function itself will check if the bad sector can be merged with
  the existing bad range.

  If not, call btrfs_read_repair_finish() to finish the current range
  first, and then add the new sector into the now empty
  btrfs_read_repair_ctrl.

  Will return -EIO if there is any range failed to be repaired.

- btrfs_read_repair_finish()
  This function should be called before the endio function exit.

  This function will iterate through all the mirrors, trying to grab
  the correct data.

  If we grabbed a correct sector, we will queue it for later writeback
  into the bad mirror.

To hold the original bad sectors, we have two bios, one named
@bad_sectors. Although it's a bio, we only utilize the bio_vec
infrastructure to hold all the initial bad sectors.

It's the @io_bio we really utilize to submit new read and write.

For io_bio, the usage is pretty much the same as
btrfs_read_repair_add_sector().

If we can merge the target sector, then that's the best case.
If not, then we submit the current @io_bio, wait for it, and allocate a
new bio for the next usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/Makefile      |   2 +-
 fs/btrfs/extent_io.c   |   2 +-
 fs/btrfs/extent_io.h   |   1 +
 fs/btrfs/read-repair.c | 328 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/read-repair.h |  48 ++++++
 5 files changed, 379 insertions(+), 2 deletions(-)
 create mode 100644 fs/btrfs/read-repair.c
 create mode 100644 fs/btrfs/read-repair.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 99f9995670ea..0b2605c750ca 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o tree-mod-log.o
+	   subpage.o tree-mod-log.o read-repair.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1083d6cfa858..160dedb078fd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2735,7 +2735,7 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
-static void end_sector_io(struct page *page, u64 offset, bool uptodate)
+void end_sector_io(struct page *page, u64 offset, bool uptodate)
 {
 	struct inode *inode = page->mapping->host;
 	u32 sectorsize = btrfs_sb(inode->i_sb)->sectorsize;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 6cdcea1551a6..e3f9db50983d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -250,6 +250,7 @@ struct bio *btrfs_bio_alloc(unsigned int nr_iovecs);
 struct bio *btrfs_bio_clone(struct block_device *bdev, struct bio *bio);
 struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size);
 
+void end_sector_io(struct page *page, u64 offset, bool uptodate);
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
 
diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
new file mode 100644
index 000000000000..26f59439fc5c
--- /dev/null
+++ b/fs/btrfs/read-repair.c
@@ -0,0 +1,328 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bio.h>
+#include "ctree.h"
+#include "volumes.h"
+#include "read-repair.h"
+#include "btrfs_inode.h"
+
+static int get_next_mirror(int mirror, int num_copies)
+{
+	if (mirror + 1 > num_copies)
+		return mirror + 1 - num_copies;
+	return mirror + 1;
+}
+
+static int get_prev_mirror(int mirror, int num_copies)
+{
+	if (mirror - 1 == 0)
+		return num_copies;
+	return mirror - 1;
+}
+static void init_new_ctrl(struct inode *inode,
+			  struct btrfs_read_repair_ctrl *ctrl,
+			  u64 logical, u64 file_offset, u8 *csum,
+			  int failed_mirror, bool is_dio)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+	ASSERT(!ctrl->inode);
+	ASSERT(!ctrl->logical);
+	ASSERT(!ctrl->bad_sectors);
+	ASSERT(!ctrl->io_bio);
+	ASSERT(!ctrl->len);
+
+	ctrl->inode = inode;
+	ctrl->logical = logical;
+	ctrl->num_copies = btrfs_num_copies(fs_info, logical,
+					    fs_info->sectorsize);
+	ctrl->is_raid56 = btrfs_is_parity_mirror(fs_info, logical,
+						 fs_info->sectorsize);
+	ctrl->is_dio = is_dio;
+	ctrl->file_offset = file_offset;
+	ctrl->failed_mirror = failed_mirror;
+	ctrl->csum = csum;
+	ctrl->bad_sectors = bio_alloc(NULL, BITS_PER_LONG, REQ_OP_READ, GFP_NOFS);
+	ctrl->bad_sectors->bi_iter.bi_sector = logical >> SECTOR_SHIFT;
+
+	ctrl->io_bio = btrfs_bio_alloc(BITS_PER_LONG);
+}
+
+int btrfs_read_repair_add_sector(struct inode *inode,
+				 struct btrfs_read_repair_ctrl *ctrl,
+				 struct page *page, unsigned int pgoff,
+				 u64 logical, u64 file_offset, u8 *csum,
+				 int failed_mirror, bool is_dio)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	const u32 size_limit = BITS_PER_LONG << fs_info->sectorsize_bits;
+	int ret = 0;
+
+	/* Mirror number should not be 0. */
+	ASSERT(failed_mirror);
+
+	/* Not initialized, initialize an empty one. */
+	if (!ctrl->inode) {
+		const int num_copies = btrfs_num_copies(fs_info, logical,
+							fs_info->sectorsize);
+
+		/* No more copies, can not repair. */
+		if (num_copies <= 1) {
+			if (!is_dio)
+				end_sector_io(page, file_offset, false);
+			return -EIO;
+		}
+
+		init_new_ctrl(inode, ctrl, logical, file_offset, csum,
+			      failed_mirror, is_dio);
+	}
+
+	/* Not continuous with the exiting range, finish the current one. */
+	if (ctrl->logical + ctrl->len != logical ||
+	    ctrl->failed_mirror != failed_mirror) {
+		ret = btrfs_read_repair_finish(ctrl);
+		init_new_ctrl(inode, ctrl, logical, file_offset, csum,
+			      failed_mirror, is_dio);
+	}
+
+	/* Continuous, just add the page into the current bio. */
+	ASSERT(ctrl->bad_sectors);
+	ASSERT(ctrl->len < size_limit);
+	bio_add_page(ctrl->bad_sectors, page, fs_info->sectorsize, pgoff);
+
+	ctrl->len += fs_info->sectorsize;
+	set_bit((logical - ctrl->logical) >> fs_info->sectorsize_bits,
+		&ctrl->bad_bitmap);
+	if (ctrl->len >= size_limit)
+		ret = btrfs_read_repair_finish(ctrl);
+	return ret;
+}
+
+/*
+ * Iterate through a bio on a per-sector basis.
+ */
+#define bio_for_each_sector(fs_info, bvl, bio, iter, bio_offset)	\
+	for ((iter) = bio->bi_iter, (bio_offset) = 0;			\
+	     (iter).bi_size &&						\
+	     (((bvl) = bio_iter_iovec((bio), (iter))), 1);		\
+	     (bio_offset) += fs_info->sectorsize,			\
+	     bio_advance_iter_single(bio, &(iter),			\
+	     (fs_info)->sectorsize))
+
+static void io_bio_submit(struct btrfs_read_repair_ctrl *ctrl, int mirror,
+			  int opf)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct bio *io_bio = ctrl->io_bio;
+	u64 io_logical;
+	u32 io_size;
+	int ret;
+
+	ASSERT(io_bio);
+	io_logical = io_bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	io_size = io_bio->bi_iter.bi_size;
+	/* Not yet utilized, just keep it for later usage. */
+	if (io_size == 0) {
+		io_bio->bi_iter.bi_sector = 0;
+		return;
+	}
+
+	io_bio->bi_opf = opf;
+	ret = btrfs_map_bio_wait(fs_info, ctrl->io_bio, mirror);
+	/* Read succeeded, clear the bad bits. */
+	if ((opf & REQ_OP_MASK) == REQ_OP_READ && !ret)
+		bitmap_clear(&ctrl->bad_bitmap,
+			(io_logical - ctrl->logical) >> fs_info->sectorsize_bits,
+			io_size >> fs_info->sectorsize_bits);
+	ctrl->io_bio = btrfs_bio_alloc(BITS_PER_LONG);
+	/* Leave bi_sector uninitialized. */
+	ctrl->io_bio->bi_iter.bi_sector = 0;
+}
+
+static void io_add_or_submit(struct btrfs_read_repair_ctrl *ctrl, int mirror,
+			   u64 logical, struct page *page, unsigned int pgoff,
+			   int opf)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct bio *io_bio = ctrl->io_bio;
+
+	/* Uninitialized. */
+	if (io_bio->bi_iter.bi_sector == 0) {
+		ASSERT(io_bio->bi_iter.bi_size == 0);
+		io_bio->bi_iter.bi_sector = logical >> SECTOR_SHIFT;
+		io_bio->bi_opf = opf;
+		bio_add_page(io_bio, page, fs_info->sectorsize, pgoff);
+		return;
+	}
+
+	/* Continuous, add the page */
+	if ((io_bio->bi_iter.bi_sector << SECTOR_SHIFT) +
+	     io_bio->bi_iter.bi_size == logical) {
+		bio_add_page(io_bio, page, fs_info->sectorsize, pgoff);
+		return;
+	}
+
+	/* Not continuous, submit first. */
+	io_bio_submit(ctrl, mirror, opf);
+	io_bio = ctrl->io_bio;
+	io_bio->bi_iter.bi_sector = logical >> SECTOR_SHIFT;
+	bio_add_page(io_bio, page, fs_info->sectorsize, pgoff);
+}
+
+static void writeback_good_mirror(struct btrfs_read_repair_ctrl *ctrl,
+				  int mirror, u64 logical,
+				  struct page *page, unsigned int pgoff)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct bio *io_bio = ctrl->io_bio;
+
+
+	if (btrfs_repair_one_zone(fs_info, ctrl->logical))
+		return;
+
+	/*
+	 * For RAID56, we can not just write the bad data back, as
+	 * any write will trigger RMW and read back the corrrupted
+	 * on-disk stripe, causing further damage.
+	 * So here we do special repair for raid56.
+	 *
+	 * And unfortunately, this repair is very low level and not
+	 * compatible with the rest of the mirror based repair.
+	 * So it's still done in synchronous mode using
+	 * btrfs_repair_io_failure().
+	 */
+	if (ctrl->is_raid56) {
+		const u64 file_offset = logical - ctrl->logical +
+					ctrl->file_offset;
+		btrfs_repair_io_failure(fs_info,
+				btrfs_ino(BTRFS_I(ctrl->inode)), file_offset,
+				fs_info->sectorsize, logical, page, pgoff,
+				mirror);
+		return;
+	}
+
+	ASSERT(io_bio);
+	io_add_or_submit(ctrl, mirror, logical, page, pgoff, REQ_OP_WRITE);
+}
+
+static void repair_from_mirror(struct btrfs_read_repair_ctrl *ctrl, int mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	unsigned long old_bitmap = ctrl->bad_bitmap;
+	const int prev_mirror = get_prev_mirror(mirror, ctrl->num_copies);
+	int nr_sector;
+	u32 offset;
+	int ret;
+
+	/*
+	 * Reset the io_bio logial bytenr so later io_add_or_submit() can do
+	 * correct check on the logical bytenr.
+	 */
+	ctrl->io_bio->bi_iter.bi_sector = 0;
+
+	/* Add all bad sectors into io_bio. */
+	bio_for_each_sector(fs_info, bv, ctrl->bad_sectors, iter, offset) {
+		u64 logical = ctrl->logical + offset;
+
+		nr_sector = offset >> fs_info->sectorsize_bits;
+
+		/* Good sectors, no need to handle. */
+		if (!test_bit(nr_sector, &ctrl->bad_bitmap))
+			continue;
+
+		io_add_or_submit(ctrl, mirror, logical, bv.bv_page,
+				 bv.bv_offset, REQ_OP_READ | REQ_SYNC);
+	}
+	io_bio_submit(ctrl, mirror, REQ_OP_READ | REQ_SYNC);
+
+	/* Check the newly read data. */
+	bio_for_each_sector(fs_info, bv, ctrl->bad_sectors, iter, offset) {
+		u8 *csum_expected;
+		u8 csum[BTRFS_CSUM_SIZE];
+
+		nr_sector = offset >> fs_info->sectorsize_bits;
+
+		/* Originally good sector or read failed, skip. */
+		if (!test_bit(nr_sector, &old_bitmap) ||
+		    test_bit(nr_sector, &ctrl->bad_bitmap))
+			continue;
+
+		/* No data csum, only need to repair. */
+		if (!ctrl->csum)
+			goto repair;
+
+		/*
+		 * The remaining case is successful read with csum, need
+		 * recheck the csum.
+		 */
+		csum_expected = btrfs_csum_ptr(fs_info, ctrl->csum, offset);
+		ret = btrfs_check_sector_csum(fs_info, bv.bv_page,
+				bv.bv_offset, csum, csum_expected);
+		if (ret) {
+			set_bit(nr_sector, &ctrl->bad_bitmap);
+			continue;
+		}
+repair:
+		/*
+		 * This sector is properly fixed, write it back to previous
+		 * bad mirror.
+		 */
+		writeback_good_mirror(ctrl, prev_mirror, ctrl->logical + offset,
+				bv.bv_page, bv.bv_offset);
+	}
+	/* Submit the last write bio. */
+	io_bio_submit(ctrl, mirror, REQ_OP_WRITE);
+}
+
+int btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
+{
+	struct btrfs_fs_info *fs_info;
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	u32 offset;
+	int nr_sectors;
+	int mirror;
+	int ret = -EIO;
+
+	if (!ctrl->inode)
+		return 0;
+
+	fs_info = btrfs_sb(ctrl->inode->i_sb);
+	nr_sectors = ctrl->len >> fs_info->sectorsize_bits;
+	ASSERT(ctrl->len);
+	/* All sectors should be bad initially. */
+	ASSERT(find_first_zero_bit(&ctrl->bad_bitmap, nr_sectors) == nr_sectors);
+
+	for (mirror = get_next_mirror(ctrl->failed_mirror, ctrl->num_copies);
+	     mirror != ctrl->failed_mirror;
+	     mirror = get_next_mirror(mirror, ctrl->num_copies)) {
+		repair_from_mirror(ctrl, mirror);
+
+		/* All repaired*/
+		if (find_first_bit(&ctrl->bad_bitmap, nr_sectors) == nr_sectors) {
+			ret = 0;
+			break;
+		}
+	}
+
+	/* DIO doesn't need any page status/extent update.*/
+	if (!ctrl->is_dio) {
+		/* Unlock all the pages and unlock the extent range. */
+		bio_for_each_sector(fs_info, bv, ctrl->bad_sectors, iter,
+				    offset) {
+			bool uptodate = !test_bit(offset >>
+						  fs_info->sectorsize_bits,
+						  &ctrl->bad_bitmap);
+
+			end_sector_io(bv.bv_page, ctrl->file_offset + offset,
+				      uptodate);
+		}
+	}
+	bio_put(ctrl->bad_sectors);
+	if (ctrl->io_bio)
+		bio_put(ctrl->io_bio);
+	memset(ctrl, 0, sizeof(*ctrl));
+	return ret;
+}
diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
new file mode 100644
index 000000000000..87219c786109
--- /dev/null
+++ b/fs/btrfs/read-repair.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_READ_REPAIR_H
+#define BTRFS_READ_REPAIR_H
+
+#include <linux/blk_types.h>
+#include <linux/fs.h>
+
+struct btrfs_read_repair_ctrl {
+	struct inode *inode;
+
+	/* The logical bytenr of the firts corrupted sector. */
+	u64 logical;
+
+	/* The file offset of the first corrupted sector. */
+	u64 file_offset;
+
+	/* The checksum for the corrupted sectors. */
+	u8 *csum;
+
+	/* Current length of the corrupted range. */
+	u32 len;
+
+	int failed_mirror;
+	int num_copies;
+	unsigned long bad_bitmap;
+	bool is_raid56;
+	bool is_dio;
+
+	/* This is only to hold all the initial bad continuous sectors. */
+	struct bio *bad_sectors;
+
+	/*
+	 * The bio we use to do the real IO.
+	 * This bio has to be btrfs_bio, as btrfs_map_bio() will utilize
+	 * btrfs_bio()->device.
+	 */
+	struct bio *io_bio;
+};
+
+int btrfs_read_repair_add_sector(struct inode *inode,
+				 struct btrfs_read_repair_ctrl *ctrl,
+				 struct page *page, unsigned int pgoff,
+				 u64 logical, u64 file_offset, u8 *csum,
+				 int failed_mirror, bool is_dio);
+int btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl);
+
+#endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/7] btrfs: make buffered read path to use the new read repair infrastructure
  2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-05-25 10:59 ` [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure Qu Wenruo
@ 2022-05-25 10:59 ` Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 6/7] btrfs: make direct io " Qu Wenruo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2022-05-25 10:59 UTC (permalink / raw)
  To: linux-btrfs

We only need to prepare the logical bytenr and file offset of the
corrupted sector, and pass it to btrfs_read_repair_add_sector().

And call btrfs_read_repair_finish() before we free the csum.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 44 ++++++++++++++++++++------------------------
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 160dedb078fd..c14699b5758b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -30,6 +30,7 @@
 #include "zoned.h"
 #include "block-group.h"
 #include "compression.h"
+#include "read-repair.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -2749,13 +2750,15 @@ void end_sector_io(struct page *page, u64 offset, bool uptodate)
 			offset + sectorsize - 1, &cached);
 }
 
-static void submit_data_read_repair(struct inode *inode, struct bio *failed_bio,
+static void submit_data_read_repair(struct inode *inode,
+		struct btrfs_read_repair_ctrl *ctrl, struct bio *failed_bio,
 		u32 bio_offset, const struct bio_vec *bvec, int failed_mirror,
 		unsigned int error_bitmap)
 {
 	const unsigned int pgoff = bvec->bv_offset;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct page *page = bvec->bv_page;
+	struct btrfs_bio *failed_bbio = btrfs_bio(failed_bio);
 	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
 	const u64 end = start + bvec->bv_len - 1;
 	const u32 sectorsize = fs_info->sectorsize;
@@ -2780,7 +2783,9 @@ static void submit_data_read_repair(struct inode *inode, struct bio *failed_bio,
 	for (i = 0; i < nr_bits; i++) {
 		const unsigned int offset = i * sectorsize;
 		bool uptodate = false;
-		int ret;
+		u64 logical = (failed_bbio->iter.bi_sector << SECTOR_SHIFT) +
+			      bio_offset + offset;
+		u8 *csum = NULL;
 
 		if (!(error_bitmap & (1U << i))) {
 			/*
@@ -2788,28 +2793,17 @@ static void submit_data_read_repair(struct inode *inode, struct bio *failed_bio,
 			 * and unlock the range.
 			 */
 			uptodate = true;
-			goto next;
-		}
-
-		ret = btrfs_repair_one_sector(inode, failed_bio,
-				bio_offset + offset,
-				page, pgoff + offset, start + offset,
-				failed_mirror, btrfs_submit_data_bio);
-		if (!ret) {
-			/*
-			 * We have submitted the read repair, the page release
-			 * will be handled by the endio function of the
-			 * submitted repair bio.
-			 * Thus we don't need to do any thing here.
-			 */
+			end_sector_io(page, start + offset, uptodate);
 			continue;
 		}
-		/*
-		 * Continue on failed repair, otherwise the remaining sectors
-		 * will not be properly unlocked.
-		 */
-next:
-		end_sector_io(page, start + offset, uptodate);
+		if (failed_bbio->csum)
+			csum = btrfs_csum_ptr(fs_info, failed_bbio->csum,
+					      bio_offset + offset);
+
+		/* The function will release the page if hit failure. */
+		btrfs_read_repair_add_sector(inode, ctrl, page, pgoff + offset,
+					     logical, start + offset, csum,
+					     failed_mirror, false);
 	}
 }
 
@@ -3017,6 +3011,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	struct extent_io_tree *tree, *failure_tree;
 	struct processed_extent processed = { 0 };
+	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	/*
 	 * The offset to the beginning of a bio, since one bio can never be
 	 * larger than UINT_MAX, u32 here is enough.
@@ -3126,8 +3121,8 @@ static void end_bio_extent_readpage(struct bio *bio)
 			 * submit_data_read_repair() will handle all the good
 			 * and bad sectors, we just continue to the next bvec.
 			 */
-			submit_data_read_repair(inode, bio, bio_offset, bvec,
-						mirror, error_bitmap);
+			submit_data_read_repair(inode, &ctrl, bio, bio_offset,
+						bvec, mirror, error_bitmap);
 		} else {
 			/* Update page status and unlock */
 			end_page_read(page, uptodate, start, len);
@@ -3142,6 +3137,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 	}
 	/* Release the last extent */
 	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
+	btrfs_read_repair_finish(&ctrl);
 	btrfs_bio_free_csum(bbio);
 	bio_put(bio);
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 6/7] btrfs: make direct io read path to use the new read repair infrastructure
  2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-05-25 10:59 ` [PATCH v2 5/7] btrfs: make buffered read path to use the new read repair infrastructure Qu Wenruo
@ 2022-05-25 10:59 ` Qu Wenruo
  2022-05-25 10:59 ` [PATCH v2 7/7] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
  2022-05-25 14:55 ` [PATCH v2 0/7] btrfs: read-repair rework based on bitmap David Sterba
  7 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2022-05-25 10:59 UTC (permalink / raw)
  To: linux-btrfs

By doing this, we can also remove the function submit_dio_repair_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 51 +++++++++++++++++-------------------------------
 1 file changed, 18 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dd0882e1b982..26718ec1e07b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -55,6 +55,7 @@
 #include "zoned.h"
 #include "subpage.h"
 #include "inode-item.h"
+#include "read-repair.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -7863,58 +7864,42 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	bio_endio(&dip->bio);
 }
 
-static void submit_dio_repair_bio(struct inode *inode, struct bio *bio,
-				  int mirror_num,
-				  enum btrfs_compression_type compress_type)
-{
-	struct btrfs_dio_private *dip = bio->bi_private;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
-
-	if (btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA))
-		return;
-
-	refcount_inc(&dip->refs);
-	if (btrfs_map_bio(fs_info, bio, mirror_num))
-		refcount_dec(&dip->refs);
-}
-
 static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 					     struct btrfs_bio *bbio,
 					     const bool uptodate)
 {
 	struct inode *inode = dip->inode;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct btrfs_read_repair_ctrl ctrl = {0};
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	blk_status_t err = BLK_STS_OK;
 	struct bvec_iter iter;
 	struct bio_vec bv;
+	int ret;
 	u32 offset;
 
 	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
 		u64 start = bbio->file_offset + offset;
 
-		if (uptodate &&
-		    (!csum || !check_data_csum(inode, bbio, offset, bv.bv_page,
-				bv.bv_offset, start))) {
-			clean_io_failure(fs_info, failure_tree, io_tree, start,
-					 bv.bv_page, btrfs_ino(BTRFS_I(inode)),
-					 bv.bv_offset);
-		} else {
-			int ret;
-
-			ret = btrfs_repair_one_sector(inode, &bbio->bio, offset,
-					bv.bv_page, bv.bv_offset, start,
-					bbio->mirror_num,
-					submit_dio_repair_bio);
+		if (!uptodate ||
+		    (csum && check_data_csum(inode, bbio, offset, bv.bv_page,
+					     bv.bv_offset, start))) {
+			u64 logical = (bbio->iter.bi_sector << SECTOR_SHIFT) +
+				      offset;
+			u8 *csum = NULL;
+
+			if (bbio->csum)
+				csum = btrfs_csum_ptr(fs_info, bbio->csum, offset);
+			ret = btrfs_read_repair_add_sector(inode, &ctrl,
+					bv.bv_page, bv.bv_offset, logical,
+					start, csum, bbio->mirror_num, true);
 			if (ret)
 				err = errno_to_blk_status(ret);
 		}
 	}
-
+	ret = btrfs_read_repair_finish(&ctrl);
+	if (ret)
+		err = errno_to_blk_status(ret);
 	return err;
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 7/7] btrfs: remove io_failure_record infrastructure completely
  2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-05-25 10:59 ` [PATCH v2 6/7] btrfs: make direct io " Qu Wenruo
@ 2022-05-25 10:59 ` Qu Wenruo
  2022-05-25 14:55 ` [PATCH v2 0/7] btrfs: read-repair rework based on bitmap David Sterba
  7 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2022-05-25 10:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Christoph Hellwig

Since our read repair are always handled by btrfs_read_repair_ctrl,
which only has the lifespan inside endio function.

This means we no longer needs to record which range and its mirror
number for failure.

Now if we failed to read some data page, we have already tried every
mirrors we have, thus no need to record the failed range.

Thus this patch can remove the whole io_failure_record structure and its
related functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/btrfs_inode.h       |   5 -
 fs/btrfs/extent-io-tree.h    |  15 --
 fs/btrfs/extent_io.c         | 372 -----------------------------------
 fs/btrfs/extent_io.h         |  24 ---
 fs/btrfs/inode.c             |   7 -
 include/trace/events/btrfs.h |   1 -
 6 files changed, 424 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 33811e896623..3eeba0eb9f16 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -91,11 +91,6 @@ struct btrfs_inode {
 	/* the io_tree does range state (DIRTY, LOCKED etc) */
 	struct extent_io_tree io_tree;
 
-	/* special utility tree used to record which mirrors have already been
-	 * tried when checksums fail for a given block
-	 */
-	struct extent_io_tree io_failure_tree;
-
 	/*
 	 * Keep track of where the inode has extent items mapped in order to
 	 * make sure the i_size adjustments are accurate
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index c3eb52dbe61c..8ab9b6cd53ed 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -56,7 +56,6 @@ enum {
 	IO_TREE_FS_EXCLUDED_EXTENTS,
 	IO_TREE_BTREE_INODE_IO,
 	IO_TREE_INODE_IO,
-	IO_TREE_INODE_IO_FAILURE,
 	IO_TREE_RELOC_BLOCKS,
 	IO_TREE_TRANS_DIRTY_PAGES,
 	IO_TREE_ROOT_DIRTY_LOG_PAGES,
@@ -250,18 +249,4 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 			       u64 *end, u64 max_bytes,
 			       struct extent_state **cached_state);
 
-/* This should be reworked in the future and put elsewhere. */
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start);
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec);
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start,
-		u64 end);
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec);
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset);
-
 #endif /* BTRFS_EXTENT_IO_TREE_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c14699b5758b..893ee81c1dfd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2172,66 +2172,6 @@ u64 count_range_bits(struct extent_io_tree *tree,
 	return total_bytes;
 }
 
-/*
- * set the private field for a given byte offset in the tree.  If there isn't
- * an extent_state there already, this does nothing.
- */
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	int ret = 0;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state->failrec = failrec;
-out:
-	spin_unlock(&tree->lock);
-	return ret;
-}
-
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	struct io_failure_record *failrec;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-
-	failrec = state->failrec;
-out:
-	spin_unlock(&tree->lock);
-	return failrec;
-}
-
 /*
  * searches a range in the state tree for a given mask.
  * If 'filled' == 1, this returns 1 only if every extent in the tree
@@ -2288,30 +2228,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec)
-{
-	int ret;
-	int err = 0;
-
-	set_state_failrec(failure_tree, rec->start, NULL);
-	ret = clear_extent_bits(failure_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret)
-		err = ret;
-
-	ret = clear_extent_bits(io_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_DAMAGED);
-	if (ret && !err)
-		err = ret;
-
-	kfree(rec);
-	return err;
-}
-
 /*
  * this bypasses the standard btrfs submit functions deliberately, as
  * the standard behavior is to write all copies in a raid setup. here we only
@@ -2427,287 +2343,6 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 	return ret;
 }
 
-/*
- * each time an IO finishes, we do a fast check in the IO failure tree
- * to see if we need to process or clean up an io_failure_record
- */
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset)
-{
-	u64 private;
-	struct io_failure_record *failrec;
-	struct extent_state *state;
-	int num_copies;
-	int ret;
-
-	private = 0;
-	ret = count_range_bits(failure_tree, &private, (u64)-1, 1,
-			       EXTENT_DIRTY, 0);
-	if (!ret)
-		return 0;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (IS_ERR(failrec))
-		return 0;
-
-	BUG_ON(!failrec->this_mirror);
-
-	if (sb_rdonly(fs_info->sb))
-		goto out;
-
-	spin_lock(&io_tree->lock);
-	state = find_first_extent_bit_state(io_tree,
-					    failrec->start,
-					    EXTENT_LOCKED);
-	spin_unlock(&io_tree->lock);
-
-	if (state && state->start <= failrec->start &&
-	    state->end >= failrec->start + failrec->len - 1) {
-		num_copies = btrfs_num_copies(fs_info, failrec->logical,
-					      failrec->len);
-		if (num_copies > 1)  {
-			btrfs_repair_io_failure(fs_info, ino, start,
-					failrec->len, failrec->logical,
-					page, pg_offset, failrec->failed_mirror);
-		}
-	}
-
-out:
-	free_io_failure(failure_tree, io_tree, failrec);
-
-	return 0;
-}
-
-/*
- * Can be called when
- * - hold extent lock
- * - under ordered extent
- * - the inode is freeing
- */
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
-{
-	struct extent_io_tree *failure_tree = &inode->io_failure_tree;
-	struct io_failure_record *failrec;
-	struct extent_state *state, *next;
-
-	if (RB_EMPTY_ROOT(&failure_tree->state))
-		return;
-
-	spin_lock(&failure_tree->lock);
-	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
-	while (state) {
-		if (state->start > end)
-			break;
-
-		ASSERT(state->end <= end);
-
-		next = next_state(state);
-
-		failrec = state->failrec;
-		free_extent_state(state);
-		kfree(failrec);
-
-		state = next;
-	}
-	spin_unlock(&failure_tree->lock);
-}
-
-static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
-							     u64 start)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct io_failure_record *failrec;
-	struct extent_map *em;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	const u32 sectorsize = fs_info->sectorsize;
-	int ret;
-	u64 logical;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (!IS_ERR(failrec)) {
-		btrfs_debug(fs_info,
-	"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu",
-			failrec->logical, failrec->start, failrec->len);
-		/*
-		 * when data can be on disk more than twice, add to failrec here
-		 * (e.g. with a list for failed_mirror) to make
-		 * clean_io_failure() clean all those errors at once.
-		 */
-
-		return failrec;
-	}
-
-	failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
-	if (!failrec)
-		return ERR_PTR(-ENOMEM);
-
-	failrec->start = start;
-	failrec->len = sectorsize;
-	failrec->this_mirror = 0;
-	failrec->compress_type = BTRFS_COMPRESS_NONE;
-
-	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree, start, failrec->len);
-	if (!em) {
-		read_unlock(&em_tree->lock);
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	if (em->start > start || em->start + em->len <= start) {
-		free_extent_map(em);
-		em = NULL;
-	}
-	read_unlock(&em_tree->lock);
-	if (!em) {
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	logical = start - em->start;
-	logical = em->block_start + logical;
-	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
-		logical = em->block_start;
-		failrec->compress_type = em->compress_type;
-	}
-
-	btrfs_debug(fs_info,
-		    "Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu",
-		    logical, start, failrec->len);
-
-	failrec->logical = logical;
-	free_extent_map(em);
-
-	/* Set the bits in the private failure tree */
-	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
-			      EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret >= 0) {
-		ret = set_state_failrec(failure_tree, start, failrec);
-		/* Set the bits in the inode's tree */
-		ret = set_extent_bits(tree, start, start + sectorsize - 1,
-				      EXTENT_DAMAGED);
-	} else if (ret < 0) {
-		kfree(failrec);
-		return ERR_PTR(ret);
-	}
-
-	return failrec;
-}
-
-static bool btrfs_check_repairable(struct inode *inode,
-				   struct io_failure_record *failrec,
-				   int failed_mirror)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	int num_copies;
-
-	num_copies = btrfs_num_copies(fs_info, failrec->logical, failrec->len);
-	if (num_copies == 1) {
-		/*
-		 * we only have a single copy of the data, so don't bother with
-		 * all the retry and error correction code that follows. no
-		 * matter what the error is, it is very likely to persist.
-		 */
-		btrfs_debug(fs_info,
-			"Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	/* The failure record should only contain one sector */
-	ASSERT(failrec->len == fs_info->sectorsize);
-
-	/*
-	 * There are two premises:
-	 * a) deliver good data to the caller
-	 * b) correct the bad sectors on disk
-	 *
-	 * Since we're only doing repair for one sector, we only need to get
-	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for btrfs_repair_io_failure() to do the rest for us.
-	 */
-	ASSERT(failed_mirror);
-	failrec->failed_mirror = failed_mirror;
-	failrec->this_mirror++;
-	if (failrec->this_mirror == failed_mirror)
-		failrec->this_mirror++;
-
-	if (failrec->this_mirror > num_copies) {
-		btrfs_debug(fs_info,
-			"Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	return true;
-}
-
-int btrfs_repair_one_sector(struct inode *inode,
-			    struct bio *failed_bio, u32 bio_offset,
-			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook)
-{
-	struct io_failure_record *failrec;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct btrfs_bio *failed_bbio = btrfs_bio(failed_bio);
-	const int icsum = bio_offset >> fs_info->sectorsize_bits;
-	struct bio *repair_bio;
-	struct btrfs_bio *repair_bbio;
-
-	btrfs_debug(fs_info,
-		   "repair read error: read error at %llu", start);
-
-	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
-
-	failrec = btrfs_get_io_failure_record(inode, start);
-	if (IS_ERR(failrec))
-		return PTR_ERR(failrec);
-
-
-	if (!btrfs_check_repairable(inode, failrec, failed_mirror)) {
-		free_io_failure(failure_tree, tree, failrec);
-		return -EIO;
-	}
-
-	repair_bio = btrfs_bio_alloc(1);
-	repair_bbio = btrfs_bio(repair_bio);
-	repair_bbio->file_offset = start;
-	repair_bio->bi_opf = REQ_OP_READ;
-	repair_bio->bi_end_io = failed_bio->bi_end_io;
-	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
-	repair_bio->bi_private = failed_bio->bi_private;
-
-	if (failed_bbio->csum) {
-		const u32 csum_size = fs_info->csum_size;
-
-		repair_bbio->csum = repair_bbio->csum_inline;
-		memcpy(repair_bbio->csum,
-		       failed_bbio->csum + csum_size * icsum, csum_size);
-	}
-
-	bio_add_page(repair_bio, page, failrec->len, pgoff);
-	repair_bbio->iter = repair_bio->bi_iter;
-
-	btrfs_debug(btrfs_sb(inode->i_sb),
-		    "repair read error: submitting new read to mirror %d",
-		    failrec->this_mirror);
-
-	/*
-	 * At this point we have a bio, so any errors from submit_bio_hook()
-	 * will be handled by the endio on the repair_bio, so we can't return an
-	 * error here.
-	 */
-	submit_bio_hook(inode, repair_bio, failrec->this_mirror, failrec->compress_type);
-	return BLK_STS_OK;
-}
-
 static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
@@ -3009,7 +2644,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
-	struct extent_io_tree *tree, *failure_tree;
 	struct processed_extent processed = { 0 };
 	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	/*
@@ -3037,8 +2671,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			"end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u",
 			bio->bi_iter.bi_sector, bio->bi_status,
 			bbio->mirror_num);
-		tree = &BTRFS_I(inode)->io_tree;
-		failure_tree = &BTRFS_I(inode)->io_failure_tree;
 
 		/*
 		 * We always issue full-sector reads, but if some block in a
@@ -3079,10 +2711,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			loff_t i_size = i_size_read(inode);
 			pgoff_t end_index = i_size >> PAGE_SHIFT;
 
-			clean_io_failure(BTRFS_I(inode)->root->fs_info,
-					 failure_tree, tree, start, page,
-					 btrfs_ino(BTRFS_I(inode)), 0);
-
 			/*
 			 * Zero out the remaining part if this range straddles
 			 * i_size.
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index e3f9db50983d..b64e8cf81405 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -61,7 +61,6 @@ struct btrfs_root;
 struct btrfs_inode;
 struct btrfs_io_bio;
 struct btrfs_fs_info;
-struct io_failure_record;
 struct extent_io_tree;
 
 typedef void (submit_bio_hook_t)(struct inode *inode, struct bio *bio,
@@ -254,29 +253,6 @@ void end_sector_io(struct page *page, u64 offset, bool uptodate);
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the sector is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-	struct page *page;
-	u64 start;
-	u64 len;
-	u64 logical;
-	enum btrfs_compression_type compress_type;
-	int this_mirror;
-	int failed_mirror;
-};
-
-int btrfs_repair_one_sector(struct inode *inode,
-			    struct bio *failed_bio, u32 bio_offset,
-			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook);
 int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 			    u64 length, u64 logical, struct page *page,
 			    unsigned int pg_offset, int mirror_num);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 26718ec1e07b..466f3359f95f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3143,8 +3143,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 					ordered_extent->disk_num_bytes);
 	}
 
-	btrfs_free_io_failure_record(inode, start, end);
-
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
 		logical_len = ordered_extent->truncated_len;
@@ -5355,8 +5353,6 @@ void btrfs_evict_inode(struct inode *inode)
 	if (is_bad_inode(inode))
 		goto no_delete;
 
-	btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
-
 	if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
 		goto no_delete;
 
@@ -8847,12 +8843,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	inode = &ei->vfs_inode;
 	extent_map_tree_init(&ei->extent_tree);
 	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO, inode);
-	extent_io_tree_init(fs_info, &ei->io_failure_tree,
-			    IO_TREE_INODE_IO_FAILURE, inode);
 	extent_io_tree_init(fs_info, &ei->file_extent_tree,
 			    IO_TREE_INODE_FILE_EXTENT, inode);
 	ei->io_tree.track_uptodate = true;
-	ei->io_failure_tree.track_uptodate = true;
 	atomic_set(&ei->sync_writers, 0);
 	mutex_init(&ei->log_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 290f07eb050a..764e9643c123 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -82,7 +82,6 @@ struct btrfs_space_info;
 	EM( IO_TREE_FS_EXCLUDED_EXTENTS,  "EXCLUDED_EXTENTS")	    \
 	EM( IO_TREE_BTREE_INODE_IO,	  "BTREE_INODE_IO")	    \
 	EM( IO_TREE_INODE_IO,		  "INODE_IO")		    \
-	EM( IO_TREE_INODE_IO_FAILURE,	  "INODE_IO_FAILURE")	    \
 	EM( IO_TREE_RELOC_BLOCKS,	  "RELOC_BLOCKS")	    \
 	EM( IO_TREE_TRANS_DIRTY_PAGES,	  "TRANS_DIRTY_PAGES")      \
 	EM( IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES")   \
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/7] btrfs: read-repair rework based on bitmap
  2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
                   ` (6 preceding siblings ...)
  2022-05-25 10:59 ` [PATCH v2 7/7] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
@ 2022-05-25 14:55 ` David Sterba
  7 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2022-05-25 14:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Christoph Hellwig

On Wed, May 25, 2022 at 06:59:10PM +0800, Qu Wenruo wrote:
> This is the bitmap version revivied, and based on Christoph's
> cleanup series.

So this is another patchset for raid repair, it may take me some time
understand the different approaches and discussions.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-25 10:59 ` [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure Qu Wenruo
@ 2022-05-26  3:06   ` Qu Wenruo
  2022-05-26  7:30     ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2022-05-26  3:06 UTC (permalink / raw)
  To: linux-btrfs, Christoph Hellwig



On 2022/5/25 18:59, Qu Wenruo wrote:
> The new read repair infrastructure is consist of the following 3 parts:
> 
[...]
> +static void io_add_or_submit(struct btrfs_read_repair_ctrl *ctrl, int mirror,
> +			   u64 logical, struct page *page, unsigned int pgoff,
> +			   int opf)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
> +	struct bio *io_bio = ctrl->io_bio;
> +
> +	/* Uninitialized. */
> +	if (io_bio->bi_iter.bi_sector == 0) {
> +		ASSERT(io_bio->bi_iter.bi_size == 0);
> +		io_bio->bi_iter.bi_sector = logical >> SECTOR_SHIFT;
> +		io_bio->bi_opf = opf;
> +		bio_add_page(io_bio, page, fs_info->sectorsize, pgoff);
> +		return;
> +	}
> +
> +	/* Continuous, add the page */
> +	if ((io_bio->bi_iter.bi_sector << SECTOR_SHIFT) +
> +	     io_bio->bi_iter.bi_size == logical) {
> +		bio_add_page(io_bio, page, fs_info->sectorsize, pgoff);
> +		return;
> +	}
> +
> +	/* Not continuous, submit first. */

Hi Christoph, I'm pretty sure the non-continuous bio problem is here for 
all of our attempts to rework read-repair.

I'm wondering if there is some "dummy" page provided from block layer 
that we can utilize?

E.g. We have the following checker pattern:

mirror 1	|X|X|X|X|
mirror 2	|X| |X| |
mirror 3	| |X| |X|

After reading all the 4 sectors from mirror 2, we know the 2nd and 4th 
are good and should not need to be-read.

Then reading mirror 3 needs us to submit two bios.

But if we have some "dummy" pages, and added into the bio for sector 2 
and 4, we only need one bio submission.

Is there such convenient page for us to utilize? Or we have to assign it 
globally?

Thanks,
Qu

> +	io_bio_submit(ctrl, mirror, opf);
> +	io_bio = ctrl->io_bio;
> +	io_bio->bi_iter.bi_sector = logical >> SECTOR_SHIFT;
> +	bio_add_page(io_bio, page, fs_info->sectorsize, pgoff);
> +}
> +
> +static void writeback_good_mirror(struct btrfs_read_repair_ctrl *ctrl,
> +				  int mirror, u64 logical,
> +				  struct page *page, unsigned int pgoff)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
> +	struct bio *io_bio = ctrl->io_bio;
> +
> +
> +	if (btrfs_repair_one_zone(fs_info, ctrl->logical))
> +		return;
> +
> +	/*
> +	 * For RAID56, we can not just write the bad data back, as
> +	 * any write will trigger RMW and read back the corrrupted
> +	 * on-disk stripe, causing further damage.
> +	 * So here we do special repair for raid56.
> +	 *
> +	 * And unfortunately, this repair is very low level and not
> +	 * compatible with the rest of the mirror based repair.
> +	 * So it's still done in synchronous mode using
> +	 * btrfs_repair_io_failure().
> +	 */
> +	if (ctrl->is_raid56) {
> +		const u64 file_offset = logical - ctrl->logical +
> +					ctrl->file_offset;
> +		btrfs_repair_io_failure(fs_info,
> +				btrfs_ino(BTRFS_I(ctrl->inode)), file_offset,
> +				fs_info->sectorsize, logical, page, pgoff,
> +				mirror);
> +		return;
> +	}
> +
> +	ASSERT(io_bio);
> +	io_add_or_submit(ctrl, mirror, logical, page, pgoff, REQ_OP_WRITE);
> +}
> +
> +static void repair_from_mirror(struct btrfs_read_repair_ctrl *ctrl, int mirror)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
> +	struct bvec_iter iter;
> +	struct bio_vec bv;
> +	unsigned long old_bitmap = ctrl->bad_bitmap;
> +	const int prev_mirror = get_prev_mirror(mirror, ctrl->num_copies);
> +	int nr_sector;
> +	u32 offset;
> +	int ret;
> +
> +	/*
> +	 * Reset the io_bio logial bytenr so later io_add_or_submit() can do
> +	 * correct check on the logical bytenr.
> +	 */
> +	ctrl->io_bio->bi_iter.bi_sector = 0;
> +
> +	/* Add all bad sectors into io_bio. */
> +	bio_for_each_sector(fs_info, bv, ctrl->bad_sectors, iter, offset) {
> +		u64 logical = ctrl->logical + offset;
> +
> +		nr_sector = offset >> fs_info->sectorsize_bits;
> +
> +		/* Good sectors, no need to handle. */
> +		if (!test_bit(nr_sector, &ctrl->bad_bitmap))
> +			continue;
> +
> +		io_add_or_submit(ctrl, mirror, logical, bv.bv_page,
> +				 bv.bv_offset, REQ_OP_READ | REQ_SYNC);
> +	}
> +	io_bio_submit(ctrl, mirror, REQ_OP_READ | REQ_SYNC);
> +
> +	/* Check the newly read data. */
> +	bio_for_each_sector(fs_info, bv, ctrl->bad_sectors, iter, offset) {
> +		u8 *csum_expected;
> +		u8 csum[BTRFS_CSUM_SIZE];
> +
> +		nr_sector = offset >> fs_info->sectorsize_bits;
> +
> +		/* Originally good sector or read failed, skip. */
> +		if (!test_bit(nr_sector, &old_bitmap) ||
> +		    test_bit(nr_sector, &ctrl->bad_bitmap))
> +			continue;
> +
> +		/* No data csum, only need to repair. */
> +		if (!ctrl->csum)
> +			goto repair;
> +
> +		/*
> +		 * The remaining case is successful read with csum, need
> +		 * recheck the csum.
> +		 */
> +		csum_expected = btrfs_csum_ptr(fs_info, ctrl->csum, offset);
> +		ret = btrfs_check_sector_csum(fs_info, bv.bv_page,
> +				bv.bv_offset, csum, csum_expected);
> +		if (ret) {
> +			set_bit(nr_sector, &ctrl->bad_bitmap);
> +			continue;
> +		}
> +repair:
> +		/*
> +		 * This sector is properly fixed, write it back to previous
> +		 * bad mirror.
> +		 */
> +		writeback_good_mirror(ctrl, prev_mirror, ctrl->logical + offset,
> +				bv.bv_page, bv.bv_offset);
> +	}
> +	/* Submit the last write bio. */
> +	io_bio_submit(ctrl, mirror, REQ_OP_WRITE);
> +}
> +
> +int btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
> +{
> +	struct btrfs_fs_info *fs_info;
> +	struct bvec_iter iter;
> +	struct bio_vec bv;
> +	u32 offset;
> +	int nr_sectors;
> +	int mirror;
> +	int ret = -EIO;
> +
> +	if (!ctrl->inode)
> +		return 0;
> +
> +	fs_info = btrfs_sb(ctrl->inode->i_sb);
> +	nr_sectors = ctrl->len >> fs_info->sectorsize_bits;
> +	ASSERT(ctrl->len);
> +	/* All sectors should be bad initially. */
> +	ASSERT(find_first_zero_bit(&ctrl->bad_bitmap, nr_sectors) == nr_sectors);
> +
> +	for (mirror = get_next_mirror(ctrl->failed_mirror, ctrl->num_copies);
> +	     mirror != ctrl->failed_mirror;
> +	     mirror = get_next_mirror(mirror, ctrl->num_copies)) {
> +		repair_from_mirror(ctrl, mirror);
> +
> +		/* All repaired*/
> +		if (find_first_bit(&ctrl->bad_bitmap, nr_sectors) == nr_sectors) {
> +			ret = 0;
> +			break;
> +		}
> +	}
> +
> +	/* DIO doesn't need any page status/extent update.*/
> +	if (!ctrl->is_dio) {
> +		/* Unlock all the pages and unlock the extent range. */
> +		bio_for_each_sector(fs_info, bv, ctrl->bad_sectors, iter,
> +				    offset) {
> +			bool uptodate = !test_bit(offset >>
> +						  fs_info->sectorsize_bits,
> +						  &ctrl->bad_bitmap);
> +
> +			end_sector_io(bv.bv_page, ctrl->file_offset + offset,
> +				      uptodate);
> +		}
> +	}
> +	bio_put(ctrl->bad_sectors);
> +	if (ctrl->io_bio)
> +		bio_put(ctrl->io_bio);
> +	memset(ctrl, 0, sizeof(*ctrl));
> +	return ret;
> +}
> diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
> new file mode 100644
> index 000000000000..87219c786109
> --- /dev/null
> +++ b/fs/btrfs/read-repair.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef BTRFS_READ_REPAIR_H
> +#define BTRFS_READ_REPAIR_H
> +
> +#include <linux/blk_types.h>
> +#include <linux/fs.h>
> +
> +struct btrfs_read_repair_ctrl {
> +	struct inode *inode;
> +
> +	/* The logical bytenr of the firts corrupted sector. */
> +	u64 logical;
> +
> +	/* The file offset of the first corrupted sector. */
> +	u64 file_offset;
> +
> +	/* The checksum for the corrupted sectors. */
> +	u8 *csum;
> +
> +	/* Current length of the corrupted range. */
> +	u32 len;
> +
> +	int failed_mirror;
> +	int num_copies;
> +	unsigned long bad_bitmap;
> +	bool is_raid56;
> +	bool is_dio;
> +
> +	/* This is only to hold all the initial bad continuous sectors. */
> +	struct bio *bad_sectors;
> +
> +	/*
> +	 * The bio we use to do the real IO.
> +	 * This bio has to be btrfs_bio, as btrfs_map_bio() will utilize
> +	 * btrfs_bio()->device.
> +	 */
> +	struct bio *io_bio;
> +};
> +
> +int btrfs_read_repair_add_sector(struct inode *inode,
> +				 struct btrfs_read_repair_ctrl *ctrl,
> +				 struct page *page, unsigned int pgoff,
> +				 u64 logical, u64 file_offset, u8 *csum,
> +				 int failed_mirror, bool is_dio);
> +int btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl);
> +
> +#endif


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  3:06   ` Qu Wenruo
@ 2022-05-26  7:30     ` Christoph Hellwig
  2022-05-26  7:37       ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-26  7:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Christoph Hellwig

On Thu, May 26, 2022 at 11:06:31AM +0800, Qu Wenruo wrote:
> Hi Christoph, I'm pretty sure the non-continuous bio problem is here for 
> all of our attempts to rework read-repair.

Why is it a problem?  Multiple discontiguous errors in the same bio
are a very unusual error pattern.  We need to handle it obviously, but
it doesn't need to be optimized as it is so rare.  The most common error
pattern is that the entire read will return an error, followed by a single
corrupted sector.

> I'm wondering if there is some "dummy" page provided from block layer that 
> we can utilize?

For reads nvme (and a few SCSI HBAs) support a bit bucket SGL for reads
that discard parts of the data.  Right now upstream none of this is
supported, altough Keith has been looking into it (for a rather different
use case) in nvme.  This does not help with writes, never mind the fact
that I would not want to use exotic and barely tested code and hardware
features for a non time critical and rarely used error handling path..

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  7:30     ` Christoph Hellwig
@ 2022-05-26  7:37       ` Qu Wenruo
  2022-05-26  7:45         ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2022-05-26  7:37 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/5/26 15:30, Christoph Hellwig wrote:
> On Thu, May 26, 2022 at 11:06:31AM +0800, Qu Wenruo wrote:
>> Hi Christoph, I'm pretty sure the non-continuous bio problem is here for
>> all of our attempts to rework read-repair.
>
> Why is it a problem?  Multiple discontiguous errors in the same bio
> are a very unusual error pattern.  We need to handle it obviously, but
> it doesn't need to be optimized as it is so rare.  The most common error
> pattern is that the entire read will return an error, followed by a single
> corrupted sector.

Rare case doesn't mean it won't happen.

We still need to address it anyway.

Furthermore, if we can submit one bio to read the whole mirror range,
without putting the corrupted data into our repaired data, it also means
we will have read at most (num_copies - 1) times, without resetting the
initial mirror.

>
>> I'm wondering if there is some "dummy" page provided from block layer that
>> we can utilize?
>
> For reads nvme (and a few SCSI HBAs) support a bit bucket SGL for reads
> that discard parts of the data.  Right now upstream none of this is
> supported, altough Keith has been looking into it (for a rather different
> use case) in nvme.  This does not help with writes, never mind the fact
> that I would not want to use exotic and barely tested code and hardware
> features for a non time critical and rarely used error handling path..

I'm not purposing the SGL method, but still do a full range read, the
only difference is, the page range we don't care will be written to some
dust bin page, and only the range we care will be put into the real pages.

E.g. we allocate a dedicated page per-fs (or even for the whole btrfs
module) as a dustbin page.

When we don't want to read some range, we just add that page into the
bio (this means we may put the same page into the bio several times, and
the page may be utilized by several different bios at the same time).
And submit the bio.

I'm not sure the current code base can handle the case though.


For write, it's pretty simple, we only writeback the whole correct range.
If we didn't recover the full corrupted range, we just don't do writeback.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  7:37       ` Qu Wenruo
@ 2022-05-26  7:45         ` Christoph Hellwig
  2022-05-26  7:52           ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-26  7:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Thu, May 26, 2022 at 03:37:47PM +0800, Qu Wenruo wrote:
> Rare case doesn't mean it won't happen.
>
> We still need to address it anyway.

address != build overly complicated code to optimize for it

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  7:45         ` Christoph Hellwig
@ 2022-05-26  7:52           ` Qu Wenruo
  2022-05-26  8:00             ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2022-05-26  7:52 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/5/26 15:45, Christoph Hellwig wrote:
> On Thu, May 26, 2022 at 03:37:47PM +0800, Qu Wenruo wrote:
>> Rare case doesn't mean it won't happen.
>>
>> We still need to address it anyway.
> 
> address != build overly complicated code to optimize for it
> 
Well, using seemly simple code, but can lead to read way more loops and 
way more data to read, is neither a good way.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  7:52           ` Qu Wenruo
@ 2022-05-26  8:00             ` Christoph Hellwig
  2022-05-26  8:07               ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-26  8:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Thu, May 26, 2022 at 03:52:03PM +0800, Qu Wenruo wrote:
>
>
> On 2022/5/26 15:45, Christoph Hellwig wrote:
>> On Thu, May 26, 2022 at 03:37:47PM +0800, Qu Wenruo wrote:
>>> Rare case doesn't mean it won't happen.
>>>
>>> We still need to address it anyway.
>>
>> address != build overly complicated code to optimize for it
>>
> Well, using seemly simple code, but can lead to read way more loops and way 
> more data to read, is neither a good way.

Again, having checkered corruption is an extremely unlikely event.
I'd rather deal with it by doing more reads than code complexity.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  8:00             ` Christoph Hellwig
@ 2022-05-26  8:07               ` Qu Wenruo
  2022-05-26  8:17                 ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2022-05-26  8:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/5/26 16:00, Christoph Hellwig wrote:
> On Thu, May 26, 2022 at 03:52:03PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2022/5/26 15:45, Christoph Hellwig wrote:
>>> On Thu, May 26, 2022 at 03:37:47PM +0800, Qu Wenruo wrote:
>>>> Rare case doesn't mean it won't happen.
>>>>
>>>> We still need to address it anyway.
>>>
>>> address != build overly complicated code to optimize for it
>>>
>> Well, using seemly simple code, but can lead to read way more loops and way
>> more data to read, is neither a good way.
> 
> Again, having checkered corruption is an extremely unlikely event.
> I'd rather deal with it by doing more reads than code complexity.
> 

Then it can be said to almost all ENOSPC error handling code.
It's less than 1% chance, but we spend over 10% code for it.

And if you really want to go that path, I see no reason why we didn't go 
sector-by-sector repair.


Furthermore if "more reads" means over 10 times the amount we need, I 
strongly doubt if it's sane.

Just the same RAID1C3, mirror 1 all corrupted, mirror 2 and 3 checker 
pattern, fill it into a 4MiB range, and try run your version of code 
starting with mirror 1, and see how many loops we need to go, especially 
how many times we need to read mirror 1 unnecessarily.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  8:07               ` Qu Wenruo
@ 2022-05-26  8:17                 ` Christoph Hellwig
  2022-05-26  8:26                   ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-26  8:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Thu, May 26, 2022 at 04:07:49PM +0800, Qu Wenruo wrote:
> Then it can be said to almost all ENOSPC error handling code.

ENOSPC is a lot more common.

> It's less than 1% chance, but we spend over 10% code for it.
>
> And if you really want to go that path, I see no reason why we didn't go 
> sector-by-sector repair.

Because that really sucks for the case where the whole I/O fails.
Which is the common failure scenario.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  8:17                 ` Christoph Hellwig
@ 2022-05-26  8:26                   ` Qu Wenruo
  2022-05-26  8:28                     ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2022-05-26  8:26 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/5/26 16:17, Christoph Hellwig wrote:
> On Thu, May 26, 2022 at 04:07:49PM +0800, Qu Wenruo wrote:
>> Then it can be said to almost all ENOSPC error handling code.
>
> ENOSPC is a lot more common.

Sorry, I mean ENOMEM.

>
>> It's less than 1% chance, but we spend over 10% code for it.
>>
>> And if you really want to go that path, I see no reason why we didn't go
>> sector-by-sector repair.
>
> Because that really sucks for the case where the whole I/O fails.
> Which is the common failure scenario.

But it's just a performance problem, which is not that critical.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  8:26                   ` Qu Wenruo
@ 2022-05-26  8:28                     ` Christoph Hellwig
  2022-05-26  8:49                       ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-26  8:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Thu, May 26, 2022 at 04:26:30PM +0800, Qu Wenruo wrote:
>
>
> On 2022/5/26 16:17, Christoph Hellwig wrote:
>> On Thu, May 26, 2022 at 04:07:49PM +0800, Qu Wenruo wrote:
>>> Then it can be said to almost all ENOSPC error handling code.
>>
>> ENOSPC is a lot more common.
>
> Sorry, I mean ENOMEM.
>
>>
>>> It's less than 1% chance, but we spend over 10% code for it.
>>>
>>> And if you really want to go that path, I see no reason why we didn't go
>>> sector-by-sector repair.
>>
>> Because that really sucks for the case where the whole I/O fails.
>> Which is the common failure scenario.
>
> But it's just a performance problem, which is not that critical.

I'm officially lost now.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  8:28                     ` Christoph Hellwig
@ 2022-05-26  8:49                       ` Qu Wenruo
  2022-05-26  8:54                         ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2022-05-26  8:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/5/26 16:28, Christoph Hellwig wrote:
> On Thu, May 26, 2022 at 04:26:30PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2022/5/26 16:17, Christoph Hellwig wrote:
>>> On Thu, May 26, 2022 at 04:07:49PM +0800, Qu Wenruo wrote:
>>>> Then it can be said to almost all ENOSPC error handling code.
>>>
>>> ENOSPC is a lot more common.
>>
>> Sorry, I mean ENOMEM.
>>
>>>
>>>> It's less than 1% chance, but we spend over 10% code for it.
>>>>
>>>> And if you really want to go that path, I see no reason why we didn't go
>>>> sector-by-sector repair.
>>>
>>> Because that really sucks for the case where the whole I/O fails.
>>> Which is the common failure scenario.
>>
>> But it's just a performance problem, which is not that critical.
>
> I'm officially lost now.

Why? If you care so much about the code simplicity, sector-by-sector is
the best.
If you care so much about the performance, the latest bitmap is the
best, no matter if it's the worst checker patter or not.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  8:49                       ` Qu Wenruo
@ 2022-05-26  8:54                         ` Christoph Hellwig
  2022-05-26  9:13                           ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-26  8:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Thu, May 26, 2022 at 04:49:15PM +0800, Qu Wenruo wrote:
>>>> Because that really sucks for the case where the whole I/O fails.
>>>> Which is the common failure scenario.
>>>
>>> But it's just a performance problem, which is not that critical.
>>
>> I'm officially lost now.
>
> Why? If you care so much about the code simplicity, sector-by-sector is
> the best.
> If you care so much about the performance, the latest bitmap is the
> best, no matter if it's the worst checker patter or not.

Because you tell me that handling the most common and important case
in read repair is just a performance issue, which you keep arguing for
micro-optimizing a corner case.  And not, for the case of failing a
large bio (which arguably can only happen for buffered I/O at the
moment, but that is another thing to look into) the bitmaps will only
help you for up to 64 sectors.  Way better than just doing a single
sector synchronous I/O but not exactly nice while still being a fair
amount of code compared to just doing variable sized synchronous
I/O.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  8:54                         ` Christoph Hellwig
@ 2022-05-26  9:13                           ` Qu Wenruo
  2022-05-27  8:10                             ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2022-05-26  9:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/5/26 16:54, Christoph Hellwig wrote:
> On Thu, May 26, 2022 at 04:49:15PM +0800, Qu Wenruo wrote:
>>>>> Because that really sucks for the case where the whole I/O fails.
>>>>> Which is the common failure scenario.
>>>>
>>>> But it's just a performance problem, which is not that critical.
>>>
>>> I'm officially lost now.
>>
>> Why? If you care so much about the code simplicity, sector-by-sector is
>> the best.
>> If you care so much about the performance, the latest bitmap is the
>> best, no matter if it's the worst checker patter or not.
>
> Because you tell me that handling the most common and important case
> in read repair is just a performance issue, which you keep arguing for
> micro-optimizing a corner case.

Because I'm fine either way, but not fine with the middle ground.

I purposed both versions, to fulfill the different requirements.
The tradeoff is not avoidable, we have to choose the poison.

If we want to go code simplicity, then my argument to support
sector-by-sector is, the repair is already a cold path (the same
argument you go on the checker pattern), thus overall the performance
drop is not that critical.

If we want to go the best perf (even corruption is already a corner
case), then there is the bitmap version, handling all cases, at the cost
of complex code.

If you can find a simpler version, and still handle the checker pattern
sanely (aka, without reading the same bad mirror again and again), sure,
I'm always fine to go that version.

Otherwise sector-by-sector or bitmap seems a more sane choice to do between.

Thanks,
Qu

>  And not, for the case of failing a
> large bio (which arguably can only happen for buffered I/O at the
> moment, but that is another thing to look into) the bitmaps will only
> help you for up to 64 sectors.  Way better than just doing a single
> sector synchronous I/O but not exactly nice while still being a fair
> amount of code compared to just doing variable sized synchronous
> I/O.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
  2022-05-26  9:13                           ` Qu Wenruo
@ 2022-05-27  8:10                             ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-27  8:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Thu, May 26, 2022 at 05:13:30PM +0800, Qu Wenruo wrote:
>> Because you tell me that handling the most common and important case
>> in read repair is just a performance issue, which you keep arguing for
>> micro-optimizing a corner case.
>
> Because I'm fine either way, but not fine with the middle ground.

Where the middle ground is optimizig for the common case while handling
the non-common cases in a non-optimized way?

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-05-27  8:10 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-25 10:59 [PATCH v2 0/7] btrfs: read-repair rework based on bitmap Qu Wenruo
2022-05-25 10:59 ` [PATCH v2 1/7] btrfs: save the original bi_iter into btrfs_bio for buffered read Qu Wenruo
2022-05-25 10:59 ` [PATCH v2 2/7] btrfs: make repair_io_failure available outside of extent_io.c Qu Wenruo
2022-05-25 10:59 ` [PATCH v2 3/7] btrfs: add a btrfs_map_bio_wait helper Qu Wenruo
2022-05-25 10:59 ` [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure Qu Wenruo
2022-05-26  3:06   ` Qu Wenruo
2022-05-26  7:30     ` Christoph Hellwig
2022-05-26  7:37       ` Qu Wenruo
2022-05-26  7:45         ` Christoph Hellwig
2022-05-26  7:52           ` Qu Wenruo
2022-05-26  8:00             ` Christoph Hellwig
2022-05-26  8:07               ` Qu Wenruo
2022-05-26  8:17                 ` Christoph Hellwig
2022-05-26  8:26                   ` Qu Wenruo
2022-05-26  8:28                     ` Christoph Hellwig
2022-05-26  8:49                       ` Qu Wenruo
2022-05-26  8:54                         ` Christoph Hellwig
2022-05-26  9:13                           ` Qu Wenruo
2022-05-27  8:10                             ` Christoph Hellwig
2022-05-25 10:59 ` [PATCH v2 5/7] btrfs: make buffered read path to use the new read repair infrastructure Qu Wenruo
2022-05-25 10:59 ` [PATCH v2 6/7] btrfs: make direct io " Qu Wenruo
2022-05-25 10:59 ` [PATCH v2 7/7] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
2022-05-25 14:55 ` [PATCH v2 0/7] btrfs: read-repair rework based on bitmap David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.