All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] btrfs: make read repair work in synchronous mode
@ 2022-05-03  6:49 Qu Wenruo
  2022-05-03  6:49 ` [PATCH 01/13] btrfs: introduce a pure data checksum checking helper Qu Wenruo
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
RFC v2 -> v1:
Most updates are to reduce the memory deadlock in endio function
context.

- Allocate a new mempool for read_repair_bio
  This is to avoid allocating the same btrfs_bio while we're still
  holding one btrfs_bio in its endio function.

  The problem is there for a long time in the existing code, only
  recently Christoph mentioned the possible deadlock scenario.

  Furthermore, our new read_repair_bio is much smaller than btrfs_bio,
  avoid wasting memory on unused members.

- Submit the assembled bio immediate if the next sector is not mergeable
  Instead of holding them in a bio list, this gives us higher chance to
  reclaim the space allocated for the read_repair_bio.

- Pre-allocate needed two bitmaps inside btrfs_submit_data_bio()
  If we failed to allocated the memory, we just fail the bio, and VFS
  layer will re-try with much smaller range, and we will have a much
  higher chance to allocate the needed memory in the next try.

- Fix the btrfs/157 failure by introducing RAID56 specific repair
  The old repair_io_failure() can handle it pretty well, although in
  that case we will lose the async bio submission, but that should still
  be acceptable just for RAID56.

RFC v1 -> RFC v2:
- Assemble a bio list for read/write bios and submit them in one go
  This allows less submit bio hooks, while still allow us to wait
  for them all to finish.

- Completely remove io_failure_tree infrastructure
  Now we don't need to remember which mirror we hit error.
  At end_bio_extent_readpage() we either get good data and done the
  repair already, or we there aren't enough mirrors for us to recover
  all data.

  This is mostly trading on-stack memory of end_bio_extent_readpage()
  with btrfs_inode::io_failure_tree.
  The latter tree has a much longer lifespan, thus I think it's still a
  win overall

[RFC POINTS]
- How to improve read_repair_get_sector()?
  Currently we always iterate the whole bio to grab the target
  page/pgoff.

  Is there any better cached method to avoid such iteration?

- Is this new code logically more reader-friendly?
  It's more for sure straight-forward, but I doubt if it's any easier to
  read compared to the old code.

- btrfs/157 failure
  Need extra check to find out why btrfs/157 failed.
  In theory, we should just iterate through all mirrors, I guess it's we
  have no way to exhaust all combinations, thus the extra 2 "mirrors"
  can gave us wrong result for RAID6.

[BEFORE]
For corrupted sectors, we just record the logical bytenr and mirror
number into io_failure_tree, then re-queue the same block with different
mirror number and call it a day.

The re-queued read will trigger enter the same endio function, with
extra failrec handling to either continue re-queue (csum mismatch/read
failure), or clear the current failrec and submit a write to fix the
corrupted mirror (read succeeded and csum match/no csum).

This is harder to read, as we need to enter the same river twice or even
more.

[AFTER]

Before submitting a data read bio, we will pre-allocate the bitmaps used
by read repair first.
If we have no memory, we just fail and let VFS layer to retry with
smaller range, and we will have a larger chance to get the memory in
next try.

For corrupted sectors, we record the following things into an on-stack
structure in end_bio_extent_readpage():

- The original bio

- The original file offset of the bio
  This is for direct IO case, as we can not grab file offset just using
  page_offset()

- Offset inside the bio of the corrupted sector

- Corrupted mirror

Then in the new btrfs_read_repair_ctrl structure, we hold those info
like:

Original bio logical = X, file_offset = Y, inode=(R/I)

Offset inside bio: 0  4k 8K 12K 16K
cur_bad_bitmap     | X| X|  | X|

Each set bit will indicate we have a corrupted sector inside the
original bio.

After we have iterated all sectors of the original bio, then we call
btrfs_read_repair_finish() to do the real repair by:

- Assemble and submit read bios
  For above case, bio offset [0, 8K) will be inside one bio, while another bio
  for bio offset [12K, 16K).

  And the page/pgoff will be extracted from the original bio.

  This is a little different from the old behavior, as old behavior will
  submit a new bio for each sector.
  The new behavior will save some btrfs_map_bio() calls.

- Submit the last read bio and wait them to finish

- Re-verify the read result

- Submit write for the corrupted mirror
  We do the same behavior just like read bios, assemble and submit them.

  And for repaired sectors, remove them from @cur_bad_bitmap.

- Do the same loop until either 1) we tried all mirrors, or 2) no more
  corrupted sectors
  
- Handle the remaining corrupted sectors
  Either mark them error for buffered read, or just return an error for
  direct IO.

By this we can:
- Remove the re-entry behavior of endio function
  Now everything is handled inside end_bio_extent_readpage().

- Remove the io_failure_tree completely
  As we don't need to record which mirror has failed.

- Slightly reduced overhead on read repair
  Now we won't call btrfs_map_bio() for each corrupted sector, as we can
  merge the sectors into a much larger bio.

Qu Wenruo (13):
  btrfs: introduce a pure data checksum checking helper
  btrfs: quit early if the fs has no RAID56 support for raid56 related
    checks
  btrfs: save the original bi_iter into btrfs_bio for buffered read
  btrfs: remove duplicated parameters from submit_data_read_repair()
  btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  btrfs: add a helper to queue a corrupted sector for read repair
  btrfs: introduce a helper to repair from one mirror
  btrfs: allow btrfs read repair to submit writes in asynchronous mode
  btrfs: handle RAID56 read repair differently
  btrfs: switch buffered read to the new read repair routine
  btrfs: switch direct IO routine to use btrfs_read_repair_ctrl
  btrfs: remove io_failure_record infrastructure completely
  btrfs: remove btrfs_inode::io_failure_tree

 fs/btrfs/Makefile            |   2 +-
 fs/btrfs/btrfs_inode.h       |   5 -
 fs/btrfs/compression.c       |  10 +-
 fs/btrfs/ctree.h             |   2 +
 fs/btrfs/extent-io-tree.h    |  15 --
 fs/btrfs/extent_io.c         | 490 +++++------------------------------
 fs/btrfs/extent_io.h         |  28 +-
 fs/btrfs/inode.c             | 121 +++++----
 fs/btrfs/read-repair.c       | 459 ++++++++++++++++++++++++++++++++
 fs/btrfs/read-repair.h       |  84 ++++++
 fs/btrfs/super.c             |   9 +-
 fs/btrfs/volumes.c           |   6 +
 fs/btrfs/volumes.h           |   4 +
 include/trace/events/btrfs.h |   1 -
 14 files changed, 695 insertions(+), 541 deletions(-)
 create mode 100644 fs/btrfs/read-repair.c
 create mode 100644 fs/btrfs/read-repair.h

-- 
2.36.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/13] btrfs: introduce a pure data checksum checking helper
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03 15:03   ` Christoph Hellwig
  2022-05-03  6:49 ` [PATCH 02/13] btrfs: quit early if the fs has no RAID56 support for raid56 related checks Qu Wenruo
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

Although we have several data csum verification code, we never have a
function really just to verify checksum for one sector.

Function check_data_csum() do extra work for error reporting, thus it
requires a lot of extra things like file offset, bio_offset etc.

Function btrfs_verify_data_csum() is even worse, it will utizlie page
checked flag, which means it can not be utilized for direct IO pages.

Here we introduce a new helper, btrfs_check_data_sector(), which really
only accept a sector in page, and expected checksum pointer.

We use this function to implement check_data_csum(), and export it for
incoming patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 10 ++++------
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/inode.c       | 45 ++++++++++++++++++++++++++++++------------
 3 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 8fda38a58706..69c060dc024c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -152,7 +152,6 @@ static int check_compressed_csum(struct btrfs_inode *inode, struct bio *bio,
 	const u32 sectorsize = fs_info->sectorsize;
 	struct page *page;
 	unsigned int i;
-	char *kaddr;
 	u8 csum[BTRFS_CSUM_SIZE];
 	struct compressed_bio *cb = bio->bi_private;
 	u8 *cb_sum = cb->sums;
@@ -175,12 +174,11 @@ static int check_compressed_csum(struct btrfs_inode *inode, struct bio *bio,
 		/* Hash through the page sector by sector */
 		for (pg_offset = 0; pg_offset < bytes_left;
 		     pg_offset += sectorsize) {
-			kaddr = kmap_atomic(page);
-			crypto_shash_digest(shash, kaddr + pg_offset,
-					    sectorsize, csum);
-			kunmap_atomic(kaddr);
+			int ret;
 
-			if (memcmp(&csum, cb_sum, csum_size) != 0) {
+			ret = btrfs_check_data_sector(fs_info, page, pg_offset,
+						      cb_sum);
+			if (ret) {
 				btrfs_print_data_csum_error(inode, disk_start,
 						csum, cb_sum, cb->mirror_num);
 				if (btrfs_bio(bio)->device)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dd23f78664f1..296fc2052aa9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3252,6 +3252,8 @@ u64 btrfs_file_extent_end(const struct btrfs_path *path);
 /* inode.c */
 void btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 			   int mirror_num, unsigned long bio_flags);
+int btrfs_check_data_sector(struct btrfs_fs_info *fs_info, struct page *page,
+			    u32 pgoff, u8 *csum_expected);
 unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
 				    u32 bio_offset, struct page *page,
 				    u64 start, u64 end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 507e3a0055e5..5b1a60a25ef6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3299,6 +3299,32 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 				       finish_ordered_fn, uptodate);
 }
 
+/*
+ * Verify the checksum for a single sector without any extra action that
+ * depend on the type of I/O.
+ */
+int btrfs_check_data_sector(struct btrfs_fs_info *fs_info, struct page *page,
+			    u32 pgoff, u8 *csum_expected)
+{
+	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	char *kaddr;
+	const u32 len = fs_info->sectorsize;
+	const u32 csum_size = fs_info->csum_size;
+	u8 csum[BTRFS_CSUM_SIZE];
+
+	ASSERT(pgoff + len <= PAGE_SIZE);
+
+	kaddr = kmap_local_page(page) + pgoff;
+	shash->tfm = fs_info->csum_shash;
+
+	crypto_shash_digest(shash, kaddr, len, csum);
+	kunmap_local(kaddr);
+
+	if (memcmp(csum, csum_expected, csum_size))
+		return -EIO;
+	return 0;
+}
+
 /*
  * check_data_csum - verify checksum of one sector of uncompressed data
  * @inode:	inode
@@ -3309,15 +3335,16 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
  * @start:	logical offset in the file
  *
  * The length of such check is always one sector size.
+ *
+ * When csum mismatch detected, we will also report the error and fill the
+ * corrupted range with zero. (thus it needs the extra parameters)
  */
 static int check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
 			   u32 bio_offset, struct page *page, u32 pgoff,
 			   u64 start)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
-	char *kaddr;
-	u32 len = fs_info->sectorsize;
+	const u32 len = fs_info->sectorsize;
 	const u32 csum_size = fs_info->csum_size;
 	unsigned int offset_sectors;
 	u8 *csum_expected;
@@ -3328,17 +3355,9 @@ static int check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
 	offset_sectors = bio_offset >> fs_info->sectorsize_bits;
 	csum_expected = ((u8 *)bbio->csum) + offset_sectors * csum_size;
 
-	kaddr = kmap_atomic(page);
-	shash->tfm = fs_info->csum_shash;
-
-	crypto_shash_digest(shash, kaddr + pgoff, len, csum);
-	kunmap_atomic(kaddr);
-
-	if (memcmp(csum, csum_expected, csum_size))
-		goto zeroit;
+	if (!btrfs_check_data_sector(fs_info, page, pgoff, csum_expected))
+		return 0;
 
-	return 0;
-zeroit:
 	btrfs_print_data_csum_error(BTRFS_I(inode), start, csum, csum_expected,
 				    bbio->mirror_num);
 	if (bbio->device)
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/13] btrfs: quit early if the fs has no RAID56 support for raid56 related checks
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
  2022-05-03  6:49 ` [PATCH 01/13] btrfs: introduce a pure data checksum checking helper Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 03/13] btrfs: save the original bi_iter into btrfs_bio for buffered read Qu Wenruo
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

The following functions do special handling for RAID56 chunks:

- btrfs_is_parity_mirror()
  Check if the range is in RAID56 chunks.

- btrfs_full_stripe_len()
  Either return sectorsize for non-RAID56 profiles or full stripe length
  for RAID56 chunks.

But if a filesystem without any RAID56 chunks, it will not have RAID56
incompt flags, and we can skip the chunk tree looking up completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index eeddd5256411..a66541ef3b18 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5781,6 +5781,9 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 	struct map_lookup *map;
 	unsigned long len = fs_info->sectorsize;
 
+	if (!btrfs_fs_incompat(fs_info, RAID56))
+		return len;
+
 	em = btrfs_get_chunk_map(fs_info, logical, len);
 
 	if (!WARN_ON(IS_ERR(em))) {
@@ -5798,6 +5801,9 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	struct map_lookup *map;
 	int ret = 0;
 
+	if (!btrfs_fs_incompat(fs_info, RAID56))
+		return 0;
+
 	em = btrfs_get_chunk_map(fs_info, logical, len);
 
 	if(!WARN_ON(IS_ERR(em))) {
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/13] btrfs: save the original bi_iter into btrfs_bio for buffered read
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
  2022-05-03  6:49 ` [PATCH 01/13] btrfs: introduce a pure data checksum checking helper Qu Wenruo
  2022-05-03  6:49 ` [PATCH 02/13] btrfs: quit early if the fs has no RAID56 support for raid56 related checks Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 04/13] btrfs: remove duplicated parameters from submit_data_read_repair() Qu Wenruo
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

Although we have btrfs_bio::iter, it currently have very limited usage:

- RAID56
  Which is not needed at all

- btrfs_bio_clone()
  This is used mostly for direct IO.

For the incoming read repair patches, we want to grab the original
logical bytenr, and be able to iterate the range of the bio (no matter
if it's cloned).

So this patch will also save btrfs_bio::iter for buffered read bios at
submit_one_bio().
And for the sake of consistency, also save the btrfs_bio::iter for
direct IO at btrfs_submit_dio_bio().

The reason that we didn't save the iter in btrfs_map_bio() is,
btrfs_map_bio() is going to handle various bios, with or without
btrfs_bio bioset.
And we  want to keep btrfs_map_bio() to handle and only handle plain bios
without bother the bioset.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 12 ++++++++++++
 fs/btrfs/inode.c     |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 07888cce3bce..0ae4ee7f344d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -186,6 +186,18 @@ static void submit_one_bio(struct bio *bio, int mirror_num, unsigned long bio_fl
 	/* Caller should ensure the bio has at least some range added */
 	ASSERT(bio->bi_iter.bi_size);
 
+	/*
+	 * Save the original bi_iter for read bios, as read repair wants the
+	 * orignial logical bytenr.
+	 *
+	 * We don't do this in btrfs_map_bio() because that function is
+	 * bioset independent.
+	 * We can later pass bios without btrfs_bio or with other bioset into
+	 * btrfs_map_bio().
+	 */
+	if (bio_op(bio) == REQ_OP_READ)
+		btrfs_bio(bio)->iter = bio->bi_iter;
+
 	if (is_data_inode(tree->private_data))
 		btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    bio_flags);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5b1a60a25ef6..f4dfb79fafce 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7926,6 +7926,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 		ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
 		if (ret)
 			goto err;
+		/* Check submit_one_bio() for the reason. */
+		btrfs_bio(bio)->iter = bio->bi_iter;
 	}
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/13] btrfs: remove duplicated parameters from submit_data_read_repair()
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 03/13] btrfs: save the original bi_iter into btrfs_bio for buffered read Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Christoph Hellwig

The function submit_data_read_repair() is only called for buffered data
read path, thus those members can be calculated using bvec directly:

- start
  start = page_offset(bvec->bv_page) + bvec->bv_offset;

- end
  end = start + bvec->bv_len - 1;

- page
  page = bvec->bv_page;

- pgoff
  pgoff = bvec->bv_offset;

Thus we can safely replace those 4 parameters with just one bio_vec.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0ae4ee7f344d..240277cdccd2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2740,13 +2740,16 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 
 static blk_status_t submit_data_read_repair(struct inode *inode,
 					    struct bio *failed_bio,
-					    u32 bio_offset, struct page *page,
-					    unsigned int pgoff,
-					    u64 start, u64 end,
+					    u32 bio_offset,
+					    const struct bio_vec *bvec,
 					    int failed_mirror,
 					    unsigned int error_bitmap)
 {
+	const unsigned int pgoff = bvec->bv_offset;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct page *page = bvec->bv_page;
+	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
+	const u64 end = start + bvec->bv_len - 1;
 	const u32 sectorsize = fs_info->sectorsize;
 	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
 	int error = 0;
@@ -3106,10 +3109,8 @@ static void end_bio_extent_readpage(struct bio *bio)
 			 * submit_data_read_repair() will handle all the good
 			 * and bad sectors, we just continue to the next bvec.
 			 */
-			submit_data_read_repair(inode, bio, bio_offset, page,
-						start - page_offset(page),
-						start, end, mirror,
-						error_bitmap);
+			submit_data_read_repair(inode, bio, bio_offset, bvec,
+						mirror, error_bitmap);
 
 			ASSERT(bio_offset + len > bio_offset);
 			bio_offset += len;
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 04/13] btrfs: remove duplicated parameters from submit_data_read_repair() Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03 15:06   ` Christoph Hellwig
  2022-05-03  6:49 ` [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

This patch will introduce a new infrastructure, btrfs_read_repair_ctrl,
to record exactly which sectors are corrupted at read time, and hold
other misc members for read time repair.

The new structure is going to be an on-stack structure for
end_bio_extent_readpage().

This new infrastructure will need the following new memory:

- For btrfs_read_repair_ctrl structure itself
  We use on-stack memory of end_bio_extent_readpage().
  Currently end_bio_extent_readpage() is the highest level endio
  function, thus there will be no more endio called.
  The extra ~100 bytes should be safe enough.

- For the extra bitmaps
  We use btrfs_bio::read_repair_{cur|prev}_bitmap.
  This will add extra 16 bytes to btrfs_bio.

  This is unavoidable, or we need to allocate extra memory at endio
  function which can be dead lock prone.

  Furthermore, pre-allocating bitmap has a hidden benefit, if we're
  submitting a large read and failed to allocated enough memory for
  bitmap, we fail the bio, and VFS layer will automatically retry
  with smaller read range, giving us a higher chance to succeed in
  next try.

Currently we only allocate the bitmaps and initialize various members,
no real work done yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/Makefile      |  2 +-
 fs/btrfs/extent_io.c   | 20 ++++++++--
 fs/btrfs/inode.c       | 13 +++++++
 fs/btrfs/read-repair.c | 88 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/read-repair.h | 47 ++++++++++++++++++++++
 fs/btrfs/volumes.h     |  4 ++
 6 files changed, 170 insertions(+), 4 deletions(-)
 create mode 100644 fs/btrfs/read-repair.c
 create mode 100644 fs/btrfs/read-repair.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 99f9995670ea..0b2605c750ca 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o tree-mod-log.o
+	   subpage.o tree-mod-log.o read-repair.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 240277cdccd2..0a7dfb638e0f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -29,6 +29,7 @@
 #include "subpage.h"
 #include "zoned.h"
 #include "block-group.h"
+#include "read-repair.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -2738,7 +2739,8 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
-static blk_status_t submit_data_read_repair(struct inode *inode,
+static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
+					    struct inode *inode,
 					    struct bio *failed_bio,
 					    u32 bio_offset,
 					    const struct bio_vec *bvec,
@@ -2752,6 +2754,9 @@ static blk_status_t submit_data_read_repair(struct inode *inode,
 	const u64 end = start + bvec->bv_len - 1;
 	const u32 sectorsize = fs_info->sectorsize;
 	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
+	/* The failed offset should be the file offset of the failed bio. */
+	const u64 failed_offset = page_offset(bio_first_page_all(failed_bio)) +
+				  bio_first_bvec_all(failed_bio)->bv_offset;
 	int error = 0;
 	int i;
 
@@ -2785,6 +2790,13 @@ static blk_status_t submit_data_read_repair(struct inode *inode,
 			goto next;
 		}
 
+		/*
+		 * Currently we only update the bitmap and do nothing in
+		 * btrfs_read_repair_finish(), thus it can co-exist with the
+		 * old code.
+		 */
+		btrfs_read_repair_add_sector(inode, ctrl, failed_bio,
+					     bio_offset + offset, failed_offset);
 		ret = btrfs_repair_one_sector(inode, failed_bio,
 				bio_offset + offset,
 				page, pgoff + offset, start + offset,
@@ -3021,6 +3033,7 @@ static struct extent_buffer *find_extent_buffer_readpage(
  */
 static void end_bio_extent_readpage(struct bio *bio)
 {
+	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	struct bio_vec *bvec;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	struct extent_io_tree *tree, *failure_tree;
@@ -3109,8 +3122,8 @@ static void end_bio_extent_readpage(struct bio *bio)
 			 * submit_data_read_repair() will handle all the good
 			 * and bad sectors, we just continue to the next bvec.
 			 */
-			submit_data_read_repair(inode, bio, bio_offset, bvec,
-						mirror, error_bitmap);
+			submit_data_read_repair(&ctrl, inode, bio, bio_offset,
+						bvec, mirror, error_bitmap);
 
 			ASSERT(bio_offset + len > bio_offset);
 			bio_offset += len;
@@ -3155,6 +3168,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 	}
 	/* Release the last extent */
 	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
+	btrfs_read_repair_finish(&ctrl);
 	btrfs_bio_free_csum(bbio);
 	bio_put(bio);
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f4dfb79fafce..24f6f30ea77f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -55,6 +55,7 @@
 #include "zoned.h"
 #include "subpage.h"
 #include "inode-item.h"
+#include "read-repair.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -2586,6 +2587,18 @@ void btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
 
+	if (bio_op(bio) == REQ_OP_READ) {
+		int tmp;
+
+		/* Allocate the extra bitmap for read time repair. */
+		tmp = btrfs_read_repair_alloc_bitmaps(fs_info, btrfs_bio(bio));
+		if (tmp) {
+			ret = errno_to_blk_status(tmp);
+			goto out;
+		}
+	}
+
+
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 		struct page *page = bio_first_bvec_all(bio)->bv_page;
 		loff_t file_offset = page_offset(page);
diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
new file mode 100644
index 000000000000..048bb7c16c56
--- /dev/null
+++ b/fs/btrfs/read-repair.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bio.h>
+#include "ctree.h"
+#include "volumes.h"
+#include "read-repair.h"
+
+int btrfs_read_repair_alloc_bitmaps(struct btrfs_fs_info *fs_info,
+				    struct btrfs_bio *bbio)
+{
+	const u32 nbits = bbio->bio.bi_iter.bi_size >> fs_info->sectorsize_bits;
+
+	ASSERT(nbits);
+	ASSERT(!bbio->read_repair_cur_bitmap &&
+	       !bbio->read_repair_prev_bitmap);
+	bbio->read_repair_cur_bitmap = bitmap_alloc(nbits, GFP_NOFS);
+	bbio->read_repair_prev_bitmap = bitmap_alloc(nbits, GFP_NOFS);
+	if (!bbio->read_repair_cur_bitmap ||
+	    !bbio->read_repair_prev_bitmap) {
+		kfree(bbio->read_repair_cur_bitmap);
+		kfree(bbio->read_repair_prev_bitmap);
+		bbio->read_repair_cur_bitmap = NULL;
+		bbio->read_repair_prev_bitmap = NULL;
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+void btrfs_read_repair_add_sector(struct inode *inode,
+				  struct btrfs_read_repair_ctrl *ctrl,
+				  struct bio *failed_bio, u32 bio_offset,
+				  u64 file_offset)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+	/* Not yet initialized. */
+	if (!ctrl->failed_bio) {
+		const u32 sectorsize = fs_info->sectorsize;
+
+		/* We should have a valid logical bytenr. */
+		ASSERT(btrfs_bio(failed_bio)->iter.bi_sector);
+		ctrl->failed_bio = failed_bio;
+		ctrl->logical = btrfs_bio(failed_bio)->iter.bi_sector <<
+				SECTOR_SHIFT;
+		ctrl->file_offset = file_offset;
+		ctrl->bio_size = btrfs_bio(failed_bio)->iter.bi_size;
+		ASSERT(ctrl->bio_size);
+		ASSERT(IS_ALIGNED(ctrl->bio_size, fs_info->sectorsize));
+		ctrl->inode = inode;
+		ctrl->init_mirror = btrfs_bio(failed_bio)->mirror_num;
+		/* At endio time, btrfs_bio::mirror should never be 0. */
+		ASSERT(ctrl->init_mirror);
+		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
+						    sectorsize);
+		/*
+		 * We use btrfs_bio::read_repair_bitmaps, which should be
+		 * preallocated before submission.
+		 * If not, we hit -ENOMEM and no need to do any repair.
+		 */
+		if (!btrfs_bio(failed_bio)->read_repair_cur_bitmap ||
+		    !btrfs_bio(failed_bio)->read_repair_prev_bitmap)
+			ctrl->error = true;
+	}
+	if (ctrl->error)
+		return;
+
+	ASSERT(ctrl->inode);
+	ASSERT(ctrl->inode == inode);
+	/*
+	 * The leading half of the bitmap is for current corrupted bits.
+	 * Thus we can just set the bit without any extra offset.
+	 */
+	set_bit(bio_offset >> fs_info->sectorsize_bits,
+		btrfs_bio(failed_bio)->read_repair_cur_bitmap);
+}
+
+void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
+{
+	if (!ctrl->failed_bio)
+		return;
+
+	kfree(btrfs_bio(ctrl->failed_bio)->read_repair_cur_bitmap);
+	kfree(btrfs_bio(ctrl->failed_bio)->read_repair_prev_bitmap);
+	btrfs_bio(ctrl->failed_bio)->read_repair_cur_bitmap = NULL;
+	btrfs_bio(ctrl->failed_bio)->read_repair_prev_bitmap = NULL;
+	ctrl->error = false;
+	ctrl->failed_bio = NULL;
+}
diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
new file mode 100644
index 000000000000..42a251f1300b
--- /dev/null
+++ b/fs/btrfs/read-repair.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_READ_REPAIR_H
+#define BTRFS_READ_REPAIR_H
+
+#include <linux/blk_types.h>
+#include <linux/fs.h>
+#include <linux/bitmap.h>
+
+/* Strucutre for data read time repair. */
+struct btrfs_read_repair_ctrl {
+	struct inode *inode;
+
+	/* The initial failed bio, we will grab page/pgoff from it */
+	struct bio *failed_bio;
+
+	/* The logical bytenr of the original bio. */
+	u64 logical;
+
+	/* File offset of the original bio. */
+	u64 file_offset;
+
+	/* The original bio size for the whole read. */
+	u32 bio_size;
+
+	/* Initial mirror number we hit the first error. */
+	u8 init_mirror;
+
+	/* How many copies we have. */
+	u8 num_copies;
+
+	/*
+	 * This is for bitmap allocation failure.
+	 * This means the btrfs_bio::read_repair_*_bitmap allocation failed
+	 * at bio allocation time.
+	 */
+	bool error;
+};
+
+int btrfs_read_repair_alloc_bitmaps(struct btrfs_fs_info *fs_info,
+				    struct btrfs_bio *bbio);
+void btrfs_read_repair_add_sector(struct inode *inode,
+				  struct btrfs_read_repair_ctrl *ctrl,
+				  struct bio *failed_bio, u32 bio_offset,
+				  u64 file_offset);
+void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl);
+#endif
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 197877e684df..420a4b8eefd0 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -337,6 +337,10 @@ struct btrfs_bio {
 	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
 	struct bvec_iter iter;
 
+	/* Bitmaps used for read time repair. */
+	unsigned long *read_repair_cur_bitmap;
+	unsigned long *read_repair_prev_bitmap;
+
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
 	 * bytes for entire btrfs_bio but relies on bio being last.
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03 15:07   ` Christoph Hellwig
  2022-05-03  6:49 ` [PATCH 07/13] btrfs: introduce a helper to repair from one mirror Qu Wenruo
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

The new helper, read_repair_bio_add_sector(), will either:

- Add the sector range into btrfs_read_repair_ctrl::cur_bio
  If we have the bio allocated and the sector range is adjacent to the
  bio.

- Allocate a new btrfs_read_repair_bio and add the page sector range to
  it
  If we have btrfs_read_repair_ctrl::cur_bio, we will first submit it,
  then allocate a new one.

This patch will also introduce a new type of bioset specifically for
read repair, the reasons are:

- A lot of extra members in btrfs_bio makes no sense for read repair

- Possible mempool deadlock under heavy memory pressure.
  We may have exhausted the last btrfs_bio in the mempool, and we will
  fail to allocate a new btrfs_bio in our endio function.

  This problem is already there for a long time in the existing read
  repair code, in the endio function we're allocating new btrfs_bio for
  each corrupted sector, which is way worse.

Thus in this patch we introduce a new bioset specifically for read repair
usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/read-repair.c | 117 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/read-repair.h |  33 ++++++++++++
 fs/btrfs/super.c       |   9 +++-
 3 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
index 048bb7c16c56..f75842dcdd02 100644
--- a/fs/btrfs/read-repair.c
+++ b/fs/btrfs/read-repair.c
@@ -5,6 +5,8 @@
 #include "volumes.h"
 #include "read-repair.h"
 
+static struct bio_set read_repair_bioset;
+
 int btrfs_read_repair_alloc_bitmaps(struct btrfs_fs_info *fs_info,
 				    struct btrfs_bio *bbio)
 {
@@ -52,6 +54,8 @@ void btrfs_read_repair_add_sector(struct inode *inode,
 		ASSERT(ctrl->init_mirror);
 		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
 						    sectorsize);
+		init_waitqueue_head(&ctrl->io_wait);
+		atomic_set(&ctrl->io_bytes, 0);
 		/*
 		 * We use btrfs_bio::read_repair_bitmaps, which should be
 		 * preallocated before submission.
@@ -74,6 +78,104 @@ void btrfs_read_repair_add_sector(struct inode *inode,
 		btrfs_bio(failed_bio)->read_repair_cur_bitmap);
 }
 
+static struct btrfs_read_repair_bio *repair_bio(struct bio *bio)
+{
+	return container_of(bio, struct btrfs_read_repair_bio, bio);
+}
+
+static void read_repair_end_bio(struct bio *bio)
+{
+	struct btrfs_read_repair_ctrl *ctrl = bio->bi_private;
+	const struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct bvec_iter_all iter_all;
+	struct bio_vec *bvec;
+	const u64 logical = repair_bio(bio)->logical;
+	u32 offset = 0;
+	bool uptodate = (bio->bi_status == BLK_STS_OK);
+
+	bio_for_each_segment_all(bvec, bio, iter_all) {
+		unsigned long *bitmap =
+			btrfs_bio(ctrl->failed_bio)->read_repair_cur_bitmap;
+		/*
+		 * If we have a successful read, clear the error bit.
+		 * In read_repair_finish(), we will re-check the csum
+		 * (if exists) later.
+		 */
+		if (uptodate)
+			clear_bit((logical + offset - ctrl->logical) >>
+				fs_info->sectorsize_bits, bitmap);
+		offset += bvec->bv_len;
+	}
+	atomic_sub(offset, &ctrl->io_bytes);
+	wake_up(&ctrl->io_wait);
+	bio_put(bio);
+}
+
+/* Add a sector into the read repair bios list for later submission */
+static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
+				       struct page *page, unsigned int pgoff,
+				       int sector_nr, int mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	struct btrfs_read_repair_bio *rbio;
+	struct bio *bio;
+	int ret;
+
+	/* Check if the sector can be added to the last bio */
+	if (ctrl->cur_bio) {
+		bio = ctrl->cur_bio;
+		rbio = repair_bio(bio);
+
+		ASSERT(rbio->logical);
+		if (rbio->logical + bio->bi_iter.bi_size ==
+		    ctrl->logical + (sector_nr << fs_info->sectorsize_bits))
+			goto add;
+
+		ASSERT(bio_op(bio) == REQ_OP_READ);
+		ASSERT(bio->bi_private == ctrl);
+		ASSERT(bio->bi_end_io == read_repair_end_bio);
+		ASSERT(repair_bio(bio)->logical >= ctrl->logical &&
+		       repair_bio(bio)->logical + bio->bi_iter.bi_size <=
+		       ctrl->logical + ctrl->bio_size);
+		/*
+		 * The current bio is not adjacent to the current range,
+		 * just submit it.
+		 *
+		 * Our endio is super atomic, and we don't want to waste time on
+		 * lookup data csum. So here we just call btrfs_map_bio()
+		 * directly.
+		 */
+		ret = btrfs_map_bio(fs_info, bio, mirror);
+		if (ret) {
+			bio->bi_status = ret;
+			bio_endio(bio);
+		}
+		ctrl->cur_bio = NULL;
+	}
+	ASSERT(ctrl->cur_bio == NULL);
+	bio = bio_alloc_bioset(NULL, BIO_MAX_VECS, 0, GFP_NOFS,
+			       &read_repair_bioset);
+	/* It's backed by mempool, thus should not fail */
+	ASSERT(bio);
+
+	rbio = repair_bio(bio);
+	rbio->logical = ctrl->logical + (sector_nr << fs_info->sectorsize_bits);
+	bio->bi_opf = REQ_OP_READ;
+	bio->bi_iter.bi_sector = rbio->logical >> SECTOR_SHIFT;
+	bio->bi_private = ctrl;
+	bio->bi_end_io = read_repair_end_bio;
+	ctrl->cur_bio = bio;
+
+add:
+	ret = bio_add_page(bio, page, fs_info->sectorsize, pgoff);
+	/*
+	 * We allocated the read bio with enough bvecs to contain
+	 * the original bio, thus it should not fail to add a sector.
+	 */
+	ASSERT(ret == fs_info->sectorsize);
+	atomic_add(fs_info->sectorsize, &ctrl->io_bytes);
+}
+
 void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 {
 	if (!ctrl->failed_bio)
@@ -86,3 +188,18 @@ void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 	ctrl->error = false;
 	ctrl->failed_bio = NULL;
 }
+
+void __cold btrfs_read_repair_exit(void)
+{
+	bioset_exit(&read_repair_bioset);
+}
+
+int __init btrfs_read_repair_init(void)
+{
+	int ret;
+
+	ret = bioset_init(&read_repair_bioset, BIO_POOL_SIZE,
+			  offsetof(struct btrfs_read_repair_bio, bio),
+			  BIOSET_NEED_BVECS);
+	return ret;
+}
diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
index 42a251f1300b..3e1430489f89 100644
--- a/fs/btrfs/read-repair.h
+++ b/fs/btrfs/read-repair.h
@@ -14,6 +14,13 @@ struct btrfs_read_repair_ctrl {
 	/* The initial failed bio, we will grab page/pgoff from it */
 	struct bio *failed_bio;
 
+	/* Currently assembled bio for read/write */
+	struct bio *cur_bio;
+
+	wait_queue_head_t io_wait;
+
+	atomic_t io_bytes;
+
 	/* The logical bytenr of the original bio. */
 	u64 logical;
 
@@ -39,9 +46,35 @@ struct btrfs_read_repair_ctrl {
 
 int btrfs_read_repair_alloc_bitmaps(struct btrfs_fs_info *fs_info,
 				    struct btrfs_bio *bbio);
+
+/*
+ * Extra info for read repair bios.
+ *
+ * Read repair bios requires less info compared to btrfs_bio:
+ * - No need for csum
+ * - No need for mirror_num
+ * - No need for file_offset
+ *   They can all be fetched from the btrfs_read_repair_ctrl stored in
+ *   bi_private.
+ *
+ * - No need for iter
+ *   We use logical bytenr from btrfs_read_repair_bio::logical.
+ *   For bio iteration, our read repair bio will never cross stripe boundary,
+ *   thus we can use regular bio_for_each_segment_all().
+ *
+ * - No need for device
+ *   We just don't care at all.
+ */
+struct btrfs_read_repair_bio {
+	u64 logical;
+	struct bio bio;
+};
+
 void btrfs_read_repair_add_sector(struct inode *inode,
 				  struct btrfs_read_repair_ctrl *ctrl,
 				  struct bio *failed_bio, u32 bio_offset,
 				  u64 file_offset);
 void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl);
+int __init btrfs_read_repair_init(void);
+void __cold btrfs_read_repair_exit(void);
 #endif
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 206f44005c52..db476e675962 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -48,6 +48,7 @@
 #include "block-group.h"
 #include "discard.h"
 #include "qgroup.h"
+#include "read-repair.h"
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
 
@@ -2642,10 +2643,12 @@ static int __init init_btrfs_fs(void)
 	err = extent_io_init();
 	if (err)
 		goto free_cachep;
-
-	err = extent_state_cache_init();
+	err = btrfs_read_repair_init();
 	if (err)
 		goto free_extent_io;
+	err = extent_state_cache_init();
+	if (err)
+		goto free_read_repair;
 
 	err = extent_map_init();
 	if (err)
@@ -2709,6 +2712,8 @@ static int __init init_btrfs_fs(void)
 	extent_map_exit();
 free_extent_state_cache:
 	extent_state_cache_exit();
+free_read_repair:
+	btrfs_read_repair_exit();
 free_extent_io:
 	extent_io_exit();
 free_cachep:
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/13] btrfs: introduce a helper to repair from one mirror
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 08/13] btrfs: allow btrfs read repair to submit writes in asynchronous mode Qu Wenruo
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

The new helper, read_repair_from_one_mirror(), will repair the data by:

- Assemble a bio and submit it for all corrupted sectors
  During the procedure, we will try to merge as many sectors as possible
  into one bio.
  If not mergeable, then we submit current bio (if we have one), then
  allocate a new one with read_repair_bioset.

  The extra bioset is to prevent allocating a btrfs_bio while the
  btrfs_bioset mempool is waiting for us to release the current one.

  Also to make sector iteration easier, we introduce a new helper,
  btrfs_bio_for_each_sector() to make our life much eaiser.

- Submit the last bio wait for all submitted bios to finish
  Here we don't want to waste time on re-search the csum, thus we use
  btrfs_map_bio() directly.

- Each successful read will clear the bit in
  btrfs_bio::read_repair_cur_bitmap

- Verify each sector of the newly read data
  We have several different combinations:

  * The read failed for one sector
    We just keep the bit in @cur_bad_bitmap, and leave it for next
    mirror.

  * The read succeeded, and the original bio has no data checksum
    We consider this a win, clear the error bit and update the page
    status

  * The read succeeded, but csum still mismatches
    Leave the error bit for next mirror.

  * The read succeeded, and csum matches
    Clear the error bit and update the page status.

- Queue the repaired data to write back to the bad mirror
  And output the a message to indicate the block got repaired.
  This message has changed a little, since we're going with mirror
  based repair, we don't care about the device.

  Thus now we only output root/ino/file_off/logical and good mirror
  number.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c   |  39 +++++------
 fs/btrfs/extent_io.h   |   4 ++
 fs/btrfs/read-repair.c | 150 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 163 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0a7dfb638e0f..3ca366742ce6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2325,9 +2325,9 @@ int free_io_failure(struct extent_io_tree *failure_tree,
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
-static int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
-			     u64 length, u64 logical, struct page *page,
-			     unsigned int pg_offset, int mirror_num)
+int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
+			    u64 length, u64 logical, struct page *page,
+			    unsigned int pg_offset, int mirror_num)
 {
 	struct btrfs_device *dev;
 	struct bio_vec bvec;
@@ -2419,8 +2419,9 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
-		ret = repair_io_failure(fs_info, 0, start, PAGE_SIZE, start, p,
-					start - page_offset(p), mirror_num);
+		ret = btrfs_repair_io_failure(fs_info, 0, start, PAGE_SIZE,
+					      start, p, start - page_offset(p),
+					      mirror_num);
 		if (ret)
 			break;
 		start += PAGE_SIZE;
@@ -2470,9 +2471,9 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
-			repair_io_failure(fs_info, ino, start, failrec->len,
-					  failrec->logical, page, pg_offset,
-					  failrec->failed_mirror);
+			btrfs_repair_io_failure(fs_info, ino, start,
+					failrec->len, failrec->logical,
+					page, pg_offset, failrec->failed_mirror);
 		}
 	}
 
@@ -2631,7 +2632,7 @@ static bool btrfs_check_repairable(struct inode *inode,
 	 *
 	 * Since we're only doing repair for one sector, we only need to get
 	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for repair_io_failure to do the rest for us.
+	 * everything for btrfs_repair_io_failure() to do the rest for us.
 	 */
 	ASSERT(failed_mirror);
 	failrec->failed_mirror = failed_mirror;
@@ -2711,7 +2712,7 @@ int btrfs_repair_one_sector(struct inode *inode,
 	return BLK_STS_OK;
 }
 
-static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
+void btrfs_end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
 
@@ -2817,7 +2818,7 @@ static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
 		if (!error)
 			error = ret;
 next:
-		end_page_read(page, uptodate, start + offset, sectorsize);
+		btrfs_end_page_read(page, uptodate, start + offset, sectorsize);
 		if (uptodate)
 			set_extent_uptodate(&BTRFS_I(inode)->io_tree,
 					start + offset,
@@ -3162,7 +3163,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 		bio_offset += len;
 
 		/* Update page status and unlock */
-		end_page_read(page, uptodate, start, len);
+		btrfs_end_page_read(page, uptodate, start, len);
 		endio_readpage_release_extent(&processed, BTRFS_I(inode),
 					      start, end, PageUptodate(page));
 	}
@@ -3695,14 +3696,14 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1, &cached);
-			end_page_read(page, true, cur, iosize);
+			btrfs_end_page_read(page, true, cur, iosize);
 			break;
 		}
 		em = __get_extent_map(inode, page, pg_offset, cur,
 				      end - cur + 1, em_cached);
 		if (IS_ERR(em)) {
 			unlock_extent(tree, cur, end);
-			end_page_read(page, false, cur, end + 1 - cur);
+			btrfs_end_page_read(page, false, cur, end + 1 - cur);
 			ret = PTR_ERR(em);
 			break;
 		}
@@ -3783,7 +3784,7 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1, &cached);
-			end_page_read(page, true, cur, iosize);
+			btrfs_end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3792,7 +3793,7 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		if (test_range_bit(tree, cur, cur_end,
 				   EXTENT_UPTODATE, 1, NULL)) {
 			unlock_extent(tree, cur, cur + iosize - 1);
-			end_page_read(page, true, cur, iosize);
+			btrfs_end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3802,7 +3803,7 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		 */
 		if (block_start == EXTENT_MAP_INLINE) {
 			unlock_extent(tree, cur, cur + iosize - 1);
-			end_page_read(page, false, cur, iosize);
+			btrfs_end_page_read(page, false, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3820,7 +3821,7 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 			 * will never be unlocked.
 			 */
 			unlock_extent(tree, cur, end);
-			end_page_read(page, false, cur, end + 1 - cur);
+			btrfs_end_page_read(page, false, cur, end + 1 - cur);
 			goto out;
 		}
 		cur = cur + iosize;
@@ -5854,7 +5855,7 @@ static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page)
 			return true;
 		/*
 		 * Even there is no eb refs here, we may still have
-		 * end_page_read() call relying on page::private.
+		 * btrfs_end_page_read() call relying on page::private.
 		 */
 		if (atomic_read(&subpage->readers))
 			return true;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b390ec79f9a8..1cf9c0bdbbdf 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -294,6 +294,10 @@ int btrfs_repair_one_sector(struct inode *inode,
 			    struct page *page, unsigned int pgoff,
 			    u64 start, int failed_mirror,
 			    submit_bio_hook_t *submit_bio_hook);
+int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
+			    u64 length, u64 logical, struct page *page,
+			    unsigned int pg_offset, int mirror_num);
+void btrfs_end_page_read(struct page *page, bool uptodate, u64 start, u32 len);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
index f75842dcdd02..3169f01e961b 100644
--- a/fs/btrfs/read-repair.c
+++ b/fs/btrfs/read-repair.c
@@ -4,6 +4,7 @@
 #include "ctree.h"
 #include "volumes.h"
 #include "read-repair.h"
+#include "btrfs_inode.h"
 
 static struct bio_set read_repair_bioset;
 
@@ -111,6 +112,31 @@ static void read_repair_end_bio(struct bio *bio)
 	bio_put(bio);
 }
 
+static void read_repair_submit_bio(struct btrfs_read_repair_ctrl *ctrl,
+				   struct btrfs_read_repair_bio *rbio,
+				   int mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
+	blk_status_t ret;
+
+	ASSERT(bio_op(&rbio->bio) == REQ_OP_READ);
+	ASSERT(rbio->bio.bi_private == ctrl);
+	ASSERT(rbio->bio.bi_end_io == read_repair_end_bio);
+	ASSERT(rbio->logical >= ctrl->logical &&
+	       rbio->logical + rbio->bio.bi_iter.bi_size <=
+	       ctrl->logical + ctrl->bio_size);
+	/*
+	 * Our endio is super atomic, and we don't want to waste time on
+	 * lookup data csum. So here we just call btrfs_map_bio()
+	 * directly.
+	 */
+	ret = btrfs_map_bio(fs_info, &rbio->bio, mirror);
+	if (ret) {
+		rbio->bio.bi_status = ret;
+		bio_endio(&rbio->bio);
+	}
+}
+
 /* Add a sector into the read repair bios list for later submission */
 static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 				       struct page *page, unsigned int pgoff,
@@ -131,21 +157,11 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 		    ctrl->logical + (sector_nr << fs_info->sectorsize_bits))
 			goto add;
 
-		ASSERT(bio_op(bio) == REQ_OP_READ);
-		ASSERT(bio->bi_private == ctrl);
-		ASSERT(bio->bi_end_io == read_repair_end_bio);
-		ASSERT(repair_bio(bio)->logical >= ctrl->logical &&
-		       repair_bio(bio)->logical + bio->bi_iter.bi_size <=
-		       ctrl->logical + ctrl->bio_size);
 		/*
 		 * The current bio is not adjacent to the current range,
 		 * just submit it.
-		 *
-		 * Our endio is super atomic, and we don't want to waste time on
-		 * lookup data csum. So here we just call btrfs_map_bio()
-		 * directly.
 		 */
-		ret = btrfs_map_bio(fs_info, bio, mirror);
+		read_repair_submit_bio(ctrl, rbio, mirror);
 		if (ret) {
 			bio->bi_status = ret;
 			bio_endio(bio);
@@ -176,6 +192,118 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 	atomic_add(fs_info->sectorsize, &ctrl->io_bytes);
 }
 
+static int get_prev_mirror(int cur_mirror, int num_copy)
+{
+	/* In the context of read-repair, we never use 0 as mirror_num. */
+	ASSERT(cur_mirror);
+	return (cur_mirror - 1 <= 0) ? (num_copy) : cur_mirror - 1;
+}
+
+/*
+ * A helper to iterate every sector of a btrfs_bio (@bbio).
+ *
+ * @bvl and @iter follow the same usage from __bio_for_each_segment().
+ * @bit will indicate the sector number inside the bbio.
+ */
+#define btrfs_bio_for_each_sector(fs_info, bvl, bbio, iter, bit)\
+	for ((iter) = (bbio->iter), bit = 0;			\
+	     (iter).bi_size &&					\
+	     ((bvl = bio_iter_iovec((&bbio->bio), (iter))), 1);	\
+	     bit++, bio_advance_iter_single((&bbio->bio), 	\
+		     &(iter), fs_info->sectorsize))
+
+static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
+					struct inode *inode, int mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	const int nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
+	const u32 sectorsize = fs_info->sectorsize;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	unsigned long *cur_bitmap =
+		btrfs_bio(ctrl->failed_bio)->read_repair_cur_bitmap;
+	unsigned long *prev_bitmap =
+		btrfs_bio(ctrl->failed_bio)->read_repair_prev_bitmap;
+	int bit;
+
+	/* We shouldn't have any pending IO */
+	ASSERT(!ctrl->cur_bio && atomic_read(&ctrl->io_bytes) == 0);
+
+	/*
+	 * @cur_bad_bitmap contains the corrupted sectors, save it to
+	 * @prev_bad_bitmap.
+	 * Now @cur_bad_bitmap is our workspace bitmap.
+	 */
+	bitmap_copy(prev_bitmap, cur_bitmap, nbits);
+
+	btrfs_bio_for_each_sector(fs_info, bvec, btrfs_bio(ctrl->failed_bio),
+				  iter, bit) {
+		/* Skip good sectors. */
+		if (!test_bit(bit, cur_bitmap))
+			continue;
+		/* Queue and submit bad sectors. */
+		read_repair_bio_add_sector(ctrl, bvec.bv_page, bvec.bv_offset,
+					   bit, mirror);
+	}
+	/* Submit the last assembled bio and wait for all bios to finish. */
+	ASSERT(ctrl->cur_bio);
+	read_repair_submit_bio(ctrl, repair_bio(ctrl->cur_bio), mirror);
+	wait_event(ctrl->io_wait, atomic_read(&ctrl->io_bytes) == 0);
+
+	/* Now re-verify the newly read out data */
+	btrfs_bio_for_each_sector(fs_info, bvec, btrfs_bio(ctrl->failed_bio),
+				  iter, bit) {
+		struct btrfs_inode *binode = BTRFS_I(ctrl->inode);
+		const u64 file_offset = ctrl->file_offset +
+					(bit << fs_info->sectorsize_bits);
+		const u64 logical = ctrl->logical +
+				    (bit << fs_info->sectorsize_bits);
+		struct extent_state *cached = NULL;
+		u8 *csum = NULL;
+		int ret;
+
+		/* Skip already good sectors. */
+		if (!test_bit(bit, prev_bitmap))
+			continue;
+		/* Still bad sectors will be kept for next mirror. */
+		if (test_bit(bit, cur_bitmap))
+			continue;
+
+		if (btrfs_bio(ctrl->failed_bio)->csum)
+			csum = btrfs_bio(ctrl->failed_bio)->csum +
+				bit * fs_info->csum_size;
+		/* No csum, and we got a good read, the data is good now. */
+		if (!csum)
+			goto uptodate;
+
+		ret = btrfs_check_data_sector(fs_info, bvec.bv_page,
+					      bvec.bv_offset, csum);
+		if (ret) {
+			/* Csum mismatch, needs retry in next mirror. */
+			set_bit(bit, cur_bitmap);
+			continue;
+		}
+uptodate:
+		/*
+		 * We repaired one sector, write the correct data back to the bad
+		 * mirror.
+		 */
+		btrfs_repair_io_failure(fs_info, btrfs_ino(binode), file_offset,
+					sectorsize, logical, bvec.bv_page,
+					bvec.bv_offset,
+					get_prev_mirror(mirror, ctrl->num_copies));
+
+		/* Update the page status and extent locks. */
+		btrfs_end_page_read(bvec.bv_page, true, file_offset, sectorsize);
+		set_extent_uptodate(&binode->io_tree,
+				file_offset, file_offset + sectorsize - 1,
+				&cached, GFP_ATOMIC);
+		unlock_extent_cached_atomic(&binode->io_tree,
+				file_offset, file_offset + sectorsize - 1,
+				&cached);
+	}
+}
+
 void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 {
 	if (!ctrl->failed_bio)
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/13] btrfs: allow btrfs read repair to submit writes in asynchronous mode
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (6 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 07/13] btrfs: introduce a helper to repair from one mirror Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 09/13] btrfs: handle RAID56 read repair differently Qu Wenruo
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

Currently if we want to submit write for read time repair, we call
btrfs_repair_io_failure(), which will submit the bio and wait for it.

But for our newer btrfs_read_repair infrastructure , we want to submit
write bios and only wait for all of them to finish. Just like how we
handle the read bios.

This patch will get rid of the btrfs_repair_io_failure() call, replacing
it with the same bios handling, by try merging the sector into the bio
first, and if not mergeable then submit the current bio and allocate a
new one.

And finally submit the last bio, and wait for all write bios to finish.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/read-repair.c | 46 +++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
index 3169f01e961b..aecdc4ee54ba 100644
--- a/fs/btrfs/read-repair.c
+++ b/fs/btrfs/read-repair.c
@@ -119,7 +119,6 @@ static void read_repair_submit_bio(struct btrfs_read_repair_ctrl *ctrl,
 	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
 	blk_status_t ret;
 
-	ASSERT(bio_op(&rbio->bio) == REQ_OP_READ);
 	ASSERT(rbio->bio.bi_private == ctrl);
 	ASSERT(rbio->bio.bi_end_io == read_repair_end_bio);
 	ASSERT(rbio->logical >= ctrl->logical &&
@@ -140,13 +139,22 @@ static void read_repair_submit_bio(struct btrfs_read_repair_ctrl *ctrl,
 /* Add a sector into the read repair bios list for later submission */
 static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 				       struct page *page, unsigned int pgoff,
-				       int sector_nr, int mirror)
+				       int sector_nr, int mirror,
+				       unsigned int opf)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(ctrl->inode->i_sb);
 	struct btrfs_read_repair_bio *rbio;
 	struct bio *bio;
 	int ret;
 
+	ASSERT(opf == REQ_OP_WRITE || opf == REQ_OP_READ);
+
+	/* For write, we need to handle zoned case first */
+	if (opf == REQ_OP_WRITE) {
+		if (btrfs_repair_one_zone(fs_info, ctrl->logical))
+			return;
+	}
+
 	/* Check if the sector can be added to the last bio */
 	if (ctrl->cur_bio) {
 		bio = ctrl->cur_bio;
@@ -162,10 +170,6 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 		 * just submit it.
 		 */
 		read_repair_submit_bio(ctrl, rbio, mirror);
-		if (ret) {
-			bio->bi_status = ret;
-			bio_endio(bio);
-		}
 		ctrl->cur_bio = NULL;
 	}
 	ASSERT(ctrl->cur_bio == NULL);
@@ -176,7 +180,7 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 
 	rbio = repair_bio(bio);
 	rbio->logical = ctrl->logical + (sector_nr << fs_info->sectorsize_bits);
-	bio->bi_opf = REQ_OP_READ;
+	bio->bi_opf = opf;
 	bio->bi_iter.bi_sector = rbio->logical >> SECTOR_SHIFT;
 	bio->bi_private = ctrl;
 	bio->bi_end_io = read_repair_end_bio;
@@ -190,6 +194,15 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 	 */
 	ASSERT(ret == fs_info->sectorsize);
 	atomic_add(fs_info->sectorsize, &ctrl->io_bytes);
+
+	/* Output a meesage about we repaired a sector. */
+	btrfs_info_rl(fs_info,
+"read error corrected: root %lld ino %llu off %llu logical %llu from good mirror %d",
+		BTRFS_I(ctrl->inode)->root->root_key.objectid,
+		btrfs_ino(BTRFS_I(ctrl->inode)),
+		ctrl->file_offset + (sector_nr << fs_info->sectorsize_bits),
+		ctrl->logical + (sector_nr << fs_info->sectorsize_bits),
+		mirror);
 }
 
 static int get_prev_mirror(int cur_mirror, int num_copy)
@@ -243,11 +256,12 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 			continue;
 		/* Queue and submit bad sectors. */
 		read_repair_bio_add_sector(ctrl, bvec.bv_page, bvec.bv_offset,
-					   bit, mirror);
+					   bit, mirror, REQ_OP_READ);
 	}
 	/* Submit the last assembled bio and wait for all bios to finish. */
 	ASSERT(ctrl->cur_bio);
 	read_repair_submit_bio(ctrl, repair_bio(ctrl->cur_bio), mirror);
+	ctrl->cur_bio = NULL;
 	wait_event(ctrl->io_wait, atomic_read(&ctrl->io_bytes) == 0);
 
 	/* Now re-verify the newly read out data */
@@ -256,8 +270,6 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 		struct btrfs_inode *binode = BTRFS_I(ctrl->inode);
 		const u64 file_offset = ctrl->file_offset +
 					(bit << fs_info->sectorsize_bits);
-		const u64 logical = ctrl->logical +
-				    (bit << fs_info->sectorsize_bits);
 		struct extent_state *cached = NULL;
 		u8 *csum = NULL;
 		int ret;
@@ -288,10 +300,10 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 		 * We repaired one sector, write the correct data back to the bad
 		 * mirror.
 		 */
-		btrfs_repair_io_failure(fs_info, btrfs_ino(binode), file_offset,
-					sectorsize, logical, bvec.bv_page,
-					bvec.bv_offset,
-					get_prev_mirror(mirror, ctrl->num_copies));
+		read_repair_bio_add_sector(ctrl, bvec.bv_page, bvec.bv_offset,
+					bit, get_prev_mirror(mirror,
+							     ctrl->num_copies),
+					REQ_OP_WRITE);
 
 		/* Update the page status and extent locks. */
 		btrfs_end_page_read(bvec.bv_page, true, file_offset, sectorsize);
@@ -302,6 +314,12 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 				file_offset, file_offset + sectorsize - 1,
 				&cached);
 	}
+	/* Submit the last write bio from above loop and wait for them. */
+	if (ctrl->cur_bio)
+		read_repair_submit_bio(ctrl, repair_bio(ctrl->cur_bio),
+				get_prev_mirror(mirror, ctrl->num_copies));
+	ctrl->cur_bio = NULL;
+	wait_event(ctrl->io_wait, atomic_read(&ctrl->io_bytes) == 0);
 }
 
 void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/13] btrfs: handle RAID56 read repair differently
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (7 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 08/13] btrfs: allow btrfs read repair to submit writes in asynchronous mode Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 10/13] btrfs: switch buffered read to the new read repair routine Qu Wenruo
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

Our current read repair facility does its work completely relying on
mirror number.
And for repaired sector, it will write the correct data back to the bad
mirror.

This works great for mirror based profiles, but for RAID56 it's a
different story.

Partial write in btrfs raid56 will lead to unconditional RMW, completely
ignoring the mirror number (which is to indicate the corrupted data
stripe number).

This will cause us to read back the corrupted data on-disk, and result
further corruption.

To address it, we introduce btrfs_read_repair_ctrl::is_raid56, and for
RAID56 read-repair, we fallback to the tried-and-tree
btrfs_repair_io_failure().

That function handles RAID56 by using MAP_READ for btrfs_map_block() and
directly write the correct data back to disk, avoiding the RMW problem.

Unfortunately we lose the asynchronous bio assembly/submission, but it
should still be more or less acceptable considering RAID56 is really an
odd ball here.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/read-repair.c | 26 ++++++++++++++++++++++++++
 fs/btrfs/read-repair.h |  2 ++
 2 files changed, 28 insertions(+)

diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
index aecdc4ee54ba..e1b11990a480 100644
--- a/fs/btrfs/read-repair.c
+++ b/fs/btrfs/read-repair.c
@@ -55,6 +55,8 @@ void btrfs_read_repair_add_sector(struct inode *inode,
 		ASSERT(ctrl->init_mirror);
 		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
 						    sectorsize);
+		ctrl->is_raid56 = btrfs_is_parity_mirror(fs_info,
+						ctrl->logical, sectorsize);
 		init_waitqueue_head(&ctrl->io_wait);
 		atomic_set(&ctrl->io_bytes, 0);
 		/*
@@ -153,6 +155,30 @@ static void read_repair_bio_add_sector(struct btrfs_read_repair_ctrl *ctrl,
 	if (opf == REQ_OP_WRITE) {
 		if (btrfs_repair_one_zone(fs_info, ctrl->logical))
 			return;
+
+		/*
+		 * For RAID56, we can not just write the bad data back, as
+		 * any write will trigger RMW and read back the corrrupted
+		 * on-disk stripe, causing further damage.
+		 * So here we do special repair for raid56.
+		 *
+		 * And unfortunately, this repair is very low level and not
+		 * compatible with the rest of the mirror based repair.
+		 * So it's still done in synchronous mode using
+		 * btrfs_repair_io_failure().
+		 */
+		if (ctrl->is_raid56) {
+			const u64 logical = ctrl->logical +
+					(sector_nr << fs_info->sectorsize_bits);
+			const u64 file_offset = ctrl->file_offset +
+					(sector_nr << fs_info->sectorsize_bits);
+
+			btrfs_repair_io_failure(fs_info,
+					btrfs_ino(BTRFS_I(ctrl->inode)),
+					file_offset, fs_info->sectorsize,
+					logical, page, pgoff, mirror);
+			return;
+		}
 	}
 
 	/* Check if the sector can be added to the last bio */
diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
index 3e1430489f89..6cc816e2ce4a 100644
--- a/fs/btrfs/read-repair.h
+++ b/fs/btrfs/read-repair.h
@@ -42,6 +42,8 @@ struct btrfs_read_repair_ctrl {
 	 * at bio allocation time.
 	 */
 	bool error;
+
+	bool is_raid56;
 };
 
 int btrfs_read_repair_alloc_bitmaps(struct btrfs_fs_info *fs_info,
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/13] btrfs: switch buffered read to the new read repair routine
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (8 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 09/13] btrfs: handle RAID56 read repair differently Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 11/13] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl Qu Wenruo
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

Instead of reply on failrec and failure_tree, we use the new
btrfs_read_repair_ctrl to do the repair.

The read time repair will have a different timing now:

- Function btrfs_submit_data_bio() will pre-allocate the bitmaps
  So if we failed to grab enough memory, the data read bio will fail
  early with ENOMEM, and VFS will retry with much smaller range,
  allowing us to have a higher chance to get memory allocated in next
  try.

- Function end_bio_extent_readpage() verifies the csum
  This is the same as usual.

- Function submit_data_read_repair() will call
  btrfs_read_repair_add_sector()
  This function will only record the corrupted sectors.

- After all the bvec are iterated, call btrfs_read_repair_finish() to
  do the repair
  Which will assemble needed bios, submit them, wait for them to finish,
  and re-check the csum for all the remaining mirrors.

Now this means, end_bio_extent_readpage() will handle all the read
pair in-house, without re-entry the same endio calls.

Currently this only works for buffered data read path.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c   | 68 +++++++++++++-----------------------------
 fs/btrfs/read-repair.c | 61 +++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3ca366742ce6..9077f0b06b92 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2740,15 +2740,11 @@ void btrfs_end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
-static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
-					    struct inode *inode,
-					    struct bio *failed_bio,
-					    u32 bio_offset,
-					    const struct bio_vec *bvec,
-					    int failed_mirror,
-					    unsigned int error_bitmap)
-{
-	const unsigned int pgoff = bvec->bv_offset;
+static void submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
+				    struct inode *inode, struct bio *failed_bio,
+				    u32 bio_offset, const struct bio_vec *bvec,
+				    int failed_mirror, unsigned int error_bitmap)
+{
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct page *page = bvec->bv_page;
 	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
@@ -2758,7 +2754,6 @@ static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
 	/* The failed offset should be the file offset of the failed bio. */
 	const u64 failed_offset = page_offset(bio_first_page_all(failed_bio)) +
 				  bio_first_bvec_all(failed_bio)->bv_offset;
-	int error = 0;
 	int i;
 
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
@@ -2780,7 +2775,6 @@ static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
 		const unsigned int offset = i * sectorsize;
 		struct extent_state *cached = NULL;
 		bool uptodate = false;
-		int ret;
 
 		if (!(error_bitmap & (1U << i))) {
 			/*
@@ -2788,48 +2782,28 @@ static blk_status_t submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
 			 * and unlock the range.
 			 */
 			uptodate = true;
-			goto next;
+			btrfs_end_page_read(page, uptodate, start + offset,
+					    sectorsize);
+			if (uptodate)
+				set_extent_uptodate(&BTRFS_I(inode)->io_tree,
+						start + offset,
+						start + offset + sectorsize - 1,
+						&cached, GFP_ATOMIC);
+			unlock_extent_cached_atomic(&BTRFS_I(inode)->io_tree,
+					start + offset,
+					start + offset + sectorsize - 1,
+					&cached);
+			continue;
 		}
 
 		/*
-		 * Currently we only update the bitmap and do nothing in
-		 * btrfs_read_repair_finish(), thus it can co-exist with the
-		 * old code.
+		 * The page read end and extent unlock will be handled in
+		 * btrfs_read_repair_finish().
 		 */
 		btrfs_read_repair_add_sector(inode, ctrl, failed_bio,
-					     bio_offset + offset, failed_offset);
-		ret = btrfs_repair_one_sector(inode, failed_bio,
-				bio_offset + offset,
-				page, pgoff + offset, start + offset,
-				failed_mirror, btrfs_submit_data_bio);
-		if (!ret) {
-			/*
-			 * We have submitted the read repair, the page release
-			 * will be handled by the endio function of the
-			 * submitted repair bio.
-			 * Thus we don't need to do any thing here.
-			 */
-			continue;
-		}
-		/*
-		 * Repair failed, just record the error but still continue.
-		 * Or the remaining sectors will not be properly unlocked.
-		 */
-		if (!error)
-			error = ret;
-next:
-		btrfs_end_page_read(page, uptodate, start + offset, sectorsize);
-		if (uptodate)
-			set_extent_uptodate(&BTRFS_I(inode)->io_tree,
-					start + offset,
-					start + offset + sectorsize - 1,
-					&cached, GFP_ATOMIC);
-		unlock_extent_cached_atomic(&BTRFS_I(inode)->io_tree,
-				start + offset,
-				start + offset + sectorsize - 1,
-				&cached);
+					     bio_offset + offset,
+					     failed_offset);
 	}
-	return errno_to_blk_status(error);
 }
 
 /* lots and lots of room for performance fixes in the end_bio funcs */
diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
index e1b11990a480..2e03de17c2b8 100644
--- a/fs/btrfs/read-repair.c
+++ b/fs/btrfs/read-repair.c
@@ -250,6 +250,18 @@ static int get_prev_mirror(int cur_mirror, int num_copy)
 	     ((bvl = bio_iter_iovec((&bbio->bio), (iter))), 1);	\
 	     bit++, bio_advance_iter_single((&bbio->bio), 	\
 		     &(iter), fs_info->sectorsize))
+static int get_next_mirror(int cur_mirror, int num_copy)
+{
+	/* In the context of read-repair, we never use 0 as mirror_num. */
+	ASSERT(cur_mirror);
+	return (cur_mirror + 1 > num_copy) ? (cur_mirror + 1 - num_copy) :
+		cur_mirror + 1;
+}
+
+static bool bitmap_all_zero(unsigned long *bitmap, int nbits)
+{
+	return (find_first_bit(bitmap, nbits) == nbits);
+}
 
 static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 					struct inode *inode, int mirror)
@@ -350,9 +362,58 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 
 void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 {
+	struct btrfs_fs_info *fs_info;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	unsigned long *cur_bitmap;
+	unsigned int nbits;
+	u32 sectorsize;
+	int bit = 0;
+	int i;
+
 	if (!ctrl->failed_bio)
 		return;
 
+	ASSERT(ctrl->inode);
+	fs_info = btrfs_sb(ctrl->inode->i_sb);
+	nbits = ctrl->bio_size >> fs_info->sectorsize_bits;
+	sectorsize = fs_info->sectorsize;
+	cur_bitmap = btrfs_bio(ctrl->failed_bio)->read_repair_cur_bitmap;
+
+	/* The original bio has no bitmap allocated, error out. */
+	if (ctrl->error)
+		goto out;
+
+	/* Go through each remaining mirrors to do the repair */
+	for (i = get_next_mirror(ctrl->init_mirror, ctrl->num_copies);
+	     i != ctrl->init_mirror; i = get_next_mirror(i, ctrl->num_copies)) {
+		read_repair_from_one_mirror(ctrl, ctrl->inode, i);
+
+		/* Check the error bitmap to see if no more corrupted sectors */
+		if (bitmap_all_zero(cur_bitmap, nbits))
+			break;
+	}
+
+	/* Finish the unrecovered bad sectors */
+	btrfs_bio_for_each_sector(fs_info, bvec, btrfs_bio(ctrl->failed_bio),
+				  iter, bit) {
+		struct extent_io_tree *io_tree = &BTRFS_I(ctrl->inode)->io_tree;
+		const u64 file_offset = (bit << fs_info->sectorsize_bits) +
+					ctrl->file_offset;
+
+		/* Skip good sectors. */
+		if (!test_bit(bit, cur_bitmap))
+			continue;
+
+		btrfs_end_page_read(bvec.bv_page, false, file_offset,
+				    sectorsize);
+		unlock_extent_cached_atomic(io_tree, file_offset,
+				file_offset + sectorsize - 1, NULL);
+	}
+
+out:
+	ASSERT(!ctrl->cur_bio);
+	ASSERT(atomic_read(&ctrl->io_bytes) == 0);
 	kfree(btrfs_bio(ctrl->failed_bio)->read_repair_cur_bitmap);
 	kfree(btrfs_bio(ctrl->failed_bio)->read_repair_prev_bitmap);
 	btrfs_bio(ctrl->failed_bio)->read_repair_cur_bitmap = NULL;
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/13] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (9 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 10/13] btrfs: switch buffered read to the new read repair routine Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 12/13] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

The framework of btrfs_read_repair_ctrl has already taken dio into
consideration, we just need to skip all the page status/extent state
update for DIO.

This patch will introduce a new bool member,
btrfs_read_repair_contrl::is_dio, to indicate if we're doing read repair
for dio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c   |  2 +-
 fs/btrfs/inode.c       | 38 ++++++++++++++-----------------------
 fs/btrfs/read-repair.c | 43 +++++++++++++++++++++++++++++++-----------
 fs/btrfs/read-repair.h |  6 ++++--
 4 files changed, 51 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9077f0b06b92..0c04ac711c8d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2802,7 +2802,7 @@ static void submit_data_read_repair(struct btrfs_read_repair_ctrl *ctrl,
 		 */
 		btrfs_read_repair_add_sector(inode, ctrl, failed_bio,
 					     bio_offset + offset,
-					     failed_offset);
+					     failed_offset, false);
 	}
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 24f6f30ea77f..dc3de3d705e2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7817,26 +7817,11 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	kfree(dip);
 }
 
-static void submit_dio_repair_bio(struct inode *inode, struct bio *bio,
-				  int mirror_num, unsigned long bio_flags)
-{
-	struct btrfs_dio_private *dip = bio->bi_private;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
-
-	if (btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA))
-		return;
-
-	refcount_inc(&dip->refs);
-	if (btrfs_map_bio(fs_info, bio, mirror_num))
-		refcount_dec(&dip->refs);
-}
-
 static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 					     struct btrfs_bio *bbio,
 					     const bool uptodate)
 {
+	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	struct inode *inode = dip->inode;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	const u32 sectorsize = fs_info->sectorsize;
@@ -7866,20 +7851,16 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 						 btrfs_ino(BTRFS_I(inode)),
 						 pgoff);
 			} else {
-				int ret;
-
-				ret = btrfs_repair_one_sector(inode, &bbio->bio,
-						bio_offset, bvec.bv_page, pgoff,
-						start, bbio->mirror_num,
-						submit_dio_repair_bio);
-				if (ret)
-					err = errno_to_blk_status(ret);
+				btrfs_read_repair_add_sector(inode, &ctrl,
+						&bbio->bio, bio_offset, start,
+						true);
 			}
 			ASSERT(bio_offset + sectorsize > bio_offset);
 			bio_offset += sectorsize;
 			pgoff += sectorsize;
 		}
 	}
+	err = errno_to_blk_status(btrfs_read_repair_finish(&ctrl));
 	return err;
 }
 
@@ -8081,6 +8062,15 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 		bio->bi_private = dip;
 		bio->bi_end_io = btrfs_end_dio_bio;
 		btrfs_bio(bio)->file_offset = file_offset;
+		if (bio_op(bio) == REQ_OP_READ) {
+			ret = btrfs_read_repair_alloc_bitmaps(fs_info,
+							      btrfs_bio(bio));
+			if (ret) {
+				status = errno_to_blk_status(ret);
+				bio_put(bio);
+				goto out_err;
+			}
+		}
 
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 			status = extract_ordered_extent(BTRFS_I(inode), bio,
diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
index 2e03de17c2b8..2e18dfc69bc0 100644
--- a/fs/btrfs/read-repair.c
+++ b/fs/btrfs/read-repair.c
@@ -32,7 +32,7 @@ int btrfs_read_repair_alloc_bitmaps(struct btrfs_fs_info *fs_info,
 void btrfs_read_repair_add_sector(struct inode *inode,
 				  struct btrfs_read_repair_ctrl *ctrl,
 				  struct bio *failed_bio, u32 bio_offset,
-				  u64 file_offset)
+				  u64 file_offset, bool is_dio)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 
@@ -51,6 +51,7 @@ void btrfs_read_repair_add_sector(struct inode *inode,
 		ASSERT(IS_ALIGNED(ctrl->bio_size, fs_info->sectorsize));
 		ctrl->inode = inode;
 		ctrl->init_mirror = btrfs_bio(failed_bio)->mirror_num;
+		ctrl->is_dio = is_dio;
 		/* At endio time, btrfs_bio::mirror should never be 0. */
 		ASSERT(ctrl->init_mirror);
 		ctrl->num_copies = btrfs_num_copies(fs_info, ctrl->logical,
@@ -73,6 +74,7 @@ void btrfs_read_repair_add_sector(struct inode *inode,
 
 	ASSERT(ctrl->inode);
 	ASSERT(ctrl->inode == inode);
+	ASSERT(ctrl->is_dio == is_dio);
 	/*
 	 * The leading half of the bitmap is for current corrupted bits.
 	 * Thus we can just set the bit without any extra offset.
@@ -343,14 +345,19 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 							     ctrl->num_copies),
 					REQ_OP_WRITE);
 
-		/* Update the page status and extent locks. */
-		btrfs_end_page_read(bvec.bv_page, true, file_offset, sectorsize);
-		set_extent_uptodate(&binode->io_tree,
-				file_offset, file_offset + sectorsize - 1,
-				&cached, GFP_ATOMIC);
-		unlock_extent_cached_atomic(&binode->io_tree,
-				file_offset, file_offset + sectorsize - 1,
-				&cached);
+		if (!ctrl->is_dio) {
+			/* Update the page status and extent locks. */
+			btrfs_end_page_read(bvec.bv_page, true, file_offset,
+					    sectorsize);
+			set_extent_uptodate(&binode->io_tree,
+					file_offset,
+					file_offset + sectorsize - 1,
+					&cached, GFP_ATOMIC);
+			unlock_extent_cached_atomic(&binode->io_tree,
+					file_offset,
+					file_offset + sectorsize - 1,
+					&cached);
+		}
 	}
 	/* Submit the last write bio from above loop and wait for them. */
 	if (ctrl->cur_bio)
@@ -360,19 +367,20 @@ static void read_repair_from_one_mirror(struct btrfs_read_repair_ctrl *ctrl,
 	wait_event(ctrl->io_wait, atomic_read(&ctrl->io_bytes) == 0);
 }
 
-void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
+int btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 {
 	struct btrfs_fs_info *fs_info;
 	struct bvec_iter iter;
 	struct bio_vec bvec;
 	unsigned long *cur_bitmap;
 	unsigned int nbits;
+	bool has_error = true;
 	u32 sectorsize;
 	int bit = 0;
 	int i;
 
 	if (!ctrl->failed_bio)
-		return;
+		return 0;
 
 	ASSERT(ctrl->inode);
 	fs_info = btrfs_sb(ctrl->inode->i_sb);
@@ -394,6 +402,10 @@ void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 			break;
 	}
 
+	/* Direct IO should not touch page status nor extent lock, exit now. */
+	if (ctrl->is_dio)
+		goto out;
+
 	/* Finish the unrecovered bad sectors */
 	btrfs_bio_for_each_sector(fs_info, bvec, btrfs_bio(ctrl->failed_bio),
 				  iter, bit) {
@@ -412,6 +424,8 @@ void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 	}
 
 out:
+	if (cur_bitmap)
+		has_error = !bitmap_all_zero(cur_bitmap, nbits);
 	ASSERT(!ctrl->cur_bio);
 	ASSERT(atomic_read(&ctrl->io_bytes) == 0);
 	kfree(btrfs_bio(ctrl->failed_bio)->read_repair_cur_bitmap);
@@ -420,6 +434,13 @@ void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl)
 	btrfs_bio(ctrl->failed_bio)->read_repair_prev_bitmap = NULL;
 	ctrl->error = false;
 	ctrl->failed_bio = NULL;
+	/*
+	 * The return value is only for Direct IO to indicate if we have
+	 * corrupted sector unrepaired.
+	 */
+	if (has_error)
+		return -EIO;
+	return 0;
 }
 
 void __cold btrfs_read_repair_exit(void)
diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
index 6cc816e2ce4a..fd0117295bc5 100644
--- a/fs/btrfs/read-repair.h
+++ b/fs/btrfs/read-repair.h
@@ -44,6 +44,8 @@ struct btrfs_read_repair_ctrl {
 	bool error;
 
 	bool is_raid56;
+
+	bool is_dio;
 };
 
 int btrfs_read_repair_alloc_bitmaps(struct btrfs_fs_info *fs_info,
@@ -75,8 +77,8 @@ struct btrfs_read_repair_bio {
 void btrfs_read_repair_add_sector(struct inode *inode,
 				  struct btrfs_read_repair_ctrl *ctrl,
 				  struct bio *failed_bio, u32 bio_offset,
-				  u64 file_offset);
-void btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl);
+				  u64 file_offset, bool is_dio);
+int btrfs_read_repair_finish(struct btrfs_read_repair_ctrl *ctrl);
 int __init btrfs_read_repair_init(void);
 void __cold btrfs_read_repair_exit(void);
 #endif
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 12/13] btrfs: remove io_failure_record infrastructure completely
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (10 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 11/13] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03  6:49 ` [PATCH 13/13] btrfs: remove btrfs_inode::io_failure_tree Qu Wenruo
  2022-05-12 17:08 ` [PATCH 00/13] btrfs: make read repair work in synchronous mode David Sterba
  13 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

Since our read repair are always handled by btrfs_read_repair_ctrl,
which only has the lifespan inside endio function.

This means we no longer needs to record which range and its mirror
number for failure.

Now if we failed to read some data page, we have already tried every
mirrors we have, thus no need to record the failed range.

Thus this patch can remove the whole io_failure_record structure and its
related functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-io-tree.h |  14 --
 fs/btrfs/extent_io.c      | 374 --------------------------------------
 fs/btrfs/extent_io.h      |  24 ---
 fs/btrfs/inode.c          |  20 +-
 4 files changed, 4 insertions(+), 428 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index c3eb52dbe61c..d46c064e5dad 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -250,18 +250,4 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 			       u64 *end, u64 max_bytes,
 			       struct extent_state **cached_state);
 
-/* This should be reworked in the future and put elsewhere. */
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start);
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec);
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start,
-		u64 end);
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec);
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset);
-
 #endif /* BTRFS_EXTENT_IO_TREE_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c04ac711c8d..42697cc7a278 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2175,66 +2175,6 @@ u64 count_range_bits(struct extent_io_tree *tree,
 	return total_bytes;
 }
 
-/*
- * set the private field for a given byte offset in the tree.  If there isn't
- * an extent_state there already, this does nothing.
- */
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	int ret = 0;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state->failrec = failrec;
-out:
-	spin_unlock(&tree->lock);
-	return ret;
-}
-
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	struct io_failure_record *failrec;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-
-	failrec = state->failrec;
-out:
-	spin_unlock(&tree->lock);
-	return failrec;
-}
-
 /*
  * searches a range in the state tree for a given mask.
  * If 'filled' == 1, this returns 1 only if every extent in the tree
@@ -2291,30 +2231,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec)
-{
-	int ret;
-	int err = 0;
-
-	set_state_failrec(failure_tree, rec->start, NULL);
-	ret = clear_extent_bits(failure_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret)
-		err = ret;
-
-	ret = clear_extent_bits(io_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_DAMAGED);
-	if (ret && !err)
-		err = ret;
-
-	kfree(rec);
-	return err;
-}
-
 /*
  * this bypasses the standard btrfs submit functions deliberately, as
  * the standard behavior is to write all copies in a raid setup. here we only
@@ -2430,288 +2346,6 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 	return ret;
 }
 
-/*
- * each time an IO finishes, we do a fast check in the IO failure tree
- * to see if we need to process or clean up an io_failure_record
- */
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset)
-{
-	u64 private;
-	struct io_failure_record *failrec;
-	struct extent_state *state;
-	int num_copies;
-	int ret;
-
-	private = 0;
-	ret = count_range_bits(failure_tree, &private, (u64)-1, 1,
-			       EXTENT_DIRTY, 0);
-	if (!ret)
-		return 0;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (IS_ERR(failrec))
-		return 0;
-
-	BUG_ON(!failrec->this_mirror);
-
-	if (sb_rdonly(fs_info->sb))
-		goto out;
-
-	spin_lock(&io_tree->lock);
-	state = find_first_extent_bit_state(io_tree,
-					    failrec->start,
-					    EXTENT_LOCKED);
-	spin_unlock(&io_tree->lock);
-
-	if (state && state->start <= failrec->start &&
-	    state->end >= failrec->start + failrec->len - 1) {
-		num_copies = btrfs_num_copies(fs_info, failrec->logical,
-					      failrec->len);
-		if (num_copies > 1)  {
-			btrfs_repair_io_failure(fs_info, ino, start,
-					failrec->len, failrec->logical,
-					page, pg_offset, failrec->failed_mirror);
-		}
-	}
-
-out:
-	free_io_failure(failure_tree, io_tree, failrec);
-
-	return 0;
-}
-
-/*
- * Can be called when
- * - hold extent lock
- * - under ordered extent
- * - the inode is freeing
- */
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
-{
-	struct extent_io_tree *failure_tree = &inode->io_failure_tree;
-	struct io_failure_record *failrec;
-	struct extent_state *state, *next;
-
-	if (RB_EMPTY_ROOT(&failure_tree->state))
-		return;
-
-	spin_lock(&failure_tree->lock);
-	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
-	while (state) {
-		if (state->start > end)
-			break;
-
-		ASSERT(state->end <= end);
-
-		next = next_state(state);
-
-		failrec = state->failrec;
-		free_extent_state(state);
-		kfree(failrec);
-
-		state = next;
-	}
-	spin_unlock(&failure_tree->lock);
-}
-
-static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
-							     u64 start)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct io_failure_record *failrec;
-	struct extent_map *em;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	const u32 sectorsize = fs_info->sectorsize;
-	int ret;
-	u64 logical;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (!IS_ERR(failrec)) {
-		btrfs_debug(fs_info,
-	"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu",
-			failrec->logical, failrec->start, failrec->len);
-		/*
-		 * when data can be on disk more than twice, add to failrec here
-		 * (e.g. with a list for failed_mirror) to make
-		 * clean_io_failure() clean all those errors at once.
-		 */
-
-		return failrec;
-	}
-
-	failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
-	if (!failrec)
-		return ERR_PTR(-ENOMEM);
-
-	failrec->start = start;
-	failrec->len = sectorsize;
-	failrec->this_mirror = 0;
-	failrec->bio_flags = 0;
-
-	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree, start, failrec->len);
-	if (!em) {
-		read_unlock(&em_tree->lock);
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	if (em->start > start || em->start + em->len <= start) {
-		free_extent_map(em);
-		em = NULL;
-	}
-	read_unlock(&em_tree->lock);
-	if (!em) {
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	logical = start - em->start;
-	logical = em->block_start + logical;
-	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
-		logical = em->block_start;
-		failrec->bio_flags = EXTENT_BIO_COMPRESSED;
-		extent_set_compress_type(&failrec->bio_flags, em->compress_type);
-	}
-
-	btrfs_debug(fs_info,
-		    "Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu",
-		    logical, start, failrec->len);
-
-	failrec->logical = logical;
-	free_extent_map(em);
-
-	/* Set the bits in the private failure tree */
-	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
-			      EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret >= 0) {
-		ret = set_state_failrec(failure_tree, start, failrec);
-		/* Set the bits in the inode's tree */
-		ret = set_extent_bits(tree, start, start + sectorsize - 1,
-				      EXTENT_DAMAGED);
-	} else if (ret < 0) {
-		kfree(failrec);
-		return ERR_PTR(ret);
-	}
-
-	return failrec;
-}
-
-static bool btrfs_check_repairable(struct inode *inode,
-				   struct io_failure_record *failrec,
-				   int failed_mirror)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	int num_copies;
-
-	num_copies = btrfs_num_copies(fs_info, failrec->logical, failrec->len);
-	if (num_copies == 1) {
-		/*
-		 * we only have a single copy of the data, so don't bother with
-		 * all the retry and error correction code that follows. no
-		 * matter what the error is, it is very likely to persist.
-		 */
-		btrfs_debug(fs_info,
-			"Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	/* The failure record should only contain one sector */
-	ASSERT(failrec->len == fs_info->sectorsize);
-
-	/*
-	 * There are two premises:
-	 * a) deliver good data to the caller
-	 * b) correct the bad sectors on disk
-	 *
-	 * Since we're only doing repair for one sector, we only need to get
-	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for btrfs_repair_io_failure() to do the rest for us.
-	 */
-	ASSERT(failed_mirror);
-	failrec->failed_mirror = failed_mirror;
-	failrec->this_mirror++;
-	if (failrec->this_mirror == failed_mirror)
-		failrec->this_mirror++;
-
-	if (failrec->this_mirror > num_copies) {
-		btrfs_debug(fs_info,
-			"Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	return true;
-}
-
-int btrfs_repair_one_sector(struct inode *inode,
-			    struct bio *failed_bio, u32 bio_offset,
-			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook)
-{
-	struct io_failure_record *failrec;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct btrfs_bio *failed_bbio = btrfs_bio(failed_bio);
-	const int icsum = bio_offset >> fs_info->sectorsize_bits;
-	struct bio *repair_bio;
-	struct btrfs_bio *repair_bbio;
-
-	btrfs_debug(fs_info,
-		   "repair read error: read error at %llu", start);
-
-	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
-
-	failrec = btrfs_get_io_failure_record(inode, start);
-	if (IS_ERR(failrec))
-		return PTR_ERR(failrec);
-
-
-	if (!btrfs_check_repairable(inode, failrec, failed_mirror)) {
-		free_io_failure(failure_tree, tree, failrec);
-		return -EIO;
-	}
-
-	repair_bio = btrfs_bio_alloc(1);
-	repair_bbio = btrfs_bio(repair_bio);
-	repair_bbio->file_offset = start;
-	repair_bio->bi_opf = REQ_OP_READ;
-	repair_bio->bi_end_io = failed_bio->bi_end_io;
-	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
-	repair_bio->bi_private = failed_bio->bi_private;
-
-	if (failed_bbio->csum) {
-		const u32 csum_size = fs_info->csum_size;
-
-		repair_bbio->csum = repair_bbio->csum_inline;
-		memcpy(repair_bbio->csum,
-		       failed_bbio->csum + csum_size * icsum, csum_size);
-	}
-
-	bio_add_page(repair_bio, page, failrec->len, pgoff);
-	repair_bbio->iter = repair_bio->bi_iter;
-
-	btrfs_debug(btrfs_sb(inode->i_sb),
-		    "repair read error: submitting new read to mirror %d",
-		    failrec->this_mirror);
-
-	/*
-	 * At this point we have a bio, so any errors from submit_bio_hook()
-	 * will be handled by the endio on the repair_bio, so we can't return an
-	 * error here.
-	 */
-	submit_bio_hook(inode, repair_bio, failrec->this_mirror, failrec->bio_flags);
-	return BLK_STS_OK;
-}
-
 void btrfs_end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
@@ -3011,7 +2645,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 	struct btrfs_read_repair_ctrl ctrl = { 0 };
 	struct bio_vec *bvec;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
-	struct extent_io_tree *tree, *failure_tree;
 	struct processed_extent processed = { 0 };
 	/*
 	 * The offset to the beginning of a bio, since one bio can never be
@@ -3038,8 +2671,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			"end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u",
 			bio->bi_iter.bi_sector, bio->bi_status,
 			bbio->mirror_num);
-		tree = &BTRFS_I(inode)->io_tree;
-		failure_tree = &BTRFS_I(inode)->io_failure_tree;
 
 		/*
 		 * We always issue full-sector reads, but if some block in a
@@ -3074,11 +2705,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			}
 			if (ret)
 				uptodate = false;
-			else
-				clean_io_failure(BTRFS_I(inode)->root->fs_info,
-						 failure_tree, tree, start,
-						 page,
-						 btrfs_ino(BTRFS_I(inode)), 0);
 		}
 
 		if (likely(uptodate))
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 1cf9c0bdbbdf..dc69ea45e955 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -68,7 +68,6 @@ struct btrfs_root;
 struct btrfs_inode;
 struct btrfs_io_bio;
 struct btrfs_fs_info;
-struct io_failure_record;
 struct extent_io_tree;
 
 typedef void (submit_bio_hook_t)(struct inode *inode, struct bio *bio,
@@ -271,29 +270,6 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size);
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the sector is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-	struct page *page;
-	u64 start;
-	u64 len;
-	u64 logical;
-	unsigned long bio_flags;
-	int this_mirror;
-	int failed_mirror;
-};
-
-int btrfs_repair_one_sector(struct inode *inode,
-			    struct bio *failed_bio, u32 bio_offset,
-			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook);
 int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 			    u64 length, u64 logical, struct page *page,
 			    unsigned int pg_offset, int mirror_num);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dc3de3d705e2..9c4cad0f4aee 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3128,8 +3128,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 					ordered_extent->disk_num_bytes);
 	}
 
-	btrfs_free_io_failure_record(inode, start, end);
-
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
 		logical_len = ordered_extent->truncated_len;
@@ -5343,8 +5341,6 @@ void btrfs_evict_inode(struct inode *inode)
 	if (is_bad_inode(inode))
 		goto no_delete;
 
-	btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
-
 	if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
 		goto no_delete;
 
@@ -7825,8 +7821,6 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 	struct inode *inode = dip->inode;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	const u32 sectorsize = fs_info->sectorsize;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	struct bio_vec bvec;
 	struct bvec_iter iter;
@@ -7842,19 +7836,13 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 			u64 start = bbio->file_offset + bio_offset;
 
 			ASSERT(pgoff < PAGE_SIZE);
-			if (uptodate &&
-			    (!csum || !check_data_csum(inode, bbio,
-						       bio_offset, bvec.bv_page,
-						       pgoff, start))) {
-				clean_io_failure(fs_info, failure_tree, io_tree,
-						 start, bvec.bv_page,
-						 btrfs_ino(BTRFS_I(inode)),
-						 pgoff);
-			} else {
+			/* Either we hit a read failure, or the csum mismatch */
+			if (!uptodate || (csum && check_data_csum(inode, bbio,
+							bio_offset, bvec.bv_page,
+							pgoff, start)))
 				btrfs_read_repair_add_sector(inode, &ctrl,
 						&bbio->bio, bio_offset, start,
 						true);
-			}
 			ASSERT(bio_offset + sectorsize > bio_offset);
 			bio_offset += sectorsize;
 			pgoff += sectorsize;
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 13/13] btrfs: remove btrfs_inode::io_failure_tree
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (11 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 12/13] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
@ 2022-05-03  6:49 ` Qu Wenruo
  2022-05-03 15:07   ` Christoph Hellwig
  2022-05-12 17:08 ` [PATCH 00/13] btrfs: make read repair work in synchronous mode David Sterba
  13 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-03  6:49 UTC (permalink / raw)
  To: linux-btrfs

Since we're handling all data read error inside the endio function,
there is no need to record any error in io_failure_tree.

Let remove btrfs_inode::io_failure_tree completely.

Although we have extra memory usage for btrfs_read_repair_ctrl, its
lifespan is only in endio, while the io_failure_tree has a lifespan as
long as each inode.
Thus this should bring a overall memory usage reduce.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/btrfs_inode.h       | 5 -----
 fs/btrfs/extent-io-tree.h    | 1 -
 fs/btrfs/inode.c             | 3 ---
 include/trace/events/btrfs.h | 1 -
 4 files changed, 10 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 32131a5d321b..2ff4da4ed2a1 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -91,11 +91,6 @@ struct btrfs_inode {
 	/* the io_tree does range state (DIRTY, LOCKED etc) */
 	struct extent_io_tree io_tree;
 
-	/* special utility tree used to record which mirrors have already been
-	 * tried when checksums fail for a given block
-	 */
-	struct extent_io_tree io_failure_tree;
-
 	/*
 	 * Keep track of where the inode has extent items mapped in order to
 	 * make sure the i_size adjustments are accurate
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index d46c064e5dad..8ab9b6cd53ed 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -56,7 +56,6 @@ enum {
 	IO_TREE_FS_EXCLUDED_EXTENTS,
 	IO_TREE_BTREE_INODE_IO,
 	IO_TREE_INODE_IO,
-	IO_TREE_INODE_IO_FAILURE,
 	IO_TREE_RELOC_BLOCKS,
 	IO_TREE_TRANS_DIRTY_PAGES,
 	IO_TREE_ROOT_DIRTY_LOG_PAGES,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9c4cad0f4aee..31c89e707e9b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8833,12 +8833,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	inode = &ei->vfs_inode;
 	extent_map_tree_init(&ei->extent_tree);
 	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO, inode);
-	extent_io_tree_init(fs_info, &ei->io_failure_tree,
-			    IO_TREE_INODE_IO_FAILURE, inode);
 	extent_io_tree_init(fs_info, &ei->file_extent_tree,
 			    IO_TREE_INODE_FILE_EXTENT, inode);
 	ei->io_tree.track_uptodate = true;
-	ei->io_failure_tree.track_uptodate = true;
 	atomic_set(&ei->sync_writers, 0);
 	mutex_init(&ei->log_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index f068ff30d654..020ca1f7687a 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -82,7 +82,6 @@ struct btrfs_space_info;
 	EM( IO_TREE_FS_EXCLUDED_EXTENTS,  "EXCLUDED_EXTENTS")	    \
 	EM( IO_TREE_BTREE_INODE_IO,	  "BTREE_INODE_IO")	    \
 	EM( IO_TREE_INODE_IO,		  "INODE_IO")		    \
-	EM( IO_TREE_INODE_IO_FAILURE,	  "INODE_IO_FAILURE")	    \
 	EM( IO_TREE_RELOC_BLOCKS,	  "RELOC_BLOCKS")	    \
 	EM( IO_TREE_TRANS_DIRTY_PAGES,	  "TRANS_DIRTY_PAGES")      \
 	EM( IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES")   \
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/13] btrfs: introduce a pure data checksum checking helper
  2022-05-03  6:49 ` [PATCH 01/13] btrfs: introduce a pure data checksum checking helper Qu Wenruo
@ 2022-05-03 15:03   ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-03 15:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> +/*
> + * Verify the checksum for a single sector without any extra action that
> + * depend on the type of I/O.
> + */
> +int btrfs_check_data_sector(struct btrfs_fs_info *fs_info, struct page *page,
> +			    u32 pgoff, u8 *csum_expected)
> +{
> +	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
> +	char *kaddr;
> +	const u32 len = fs_info->sectorsize;
> +	const u32 csum_size = fs_info->csum_size;
> +	u8 csum[BTRFS_CSUM_SIZE];
> +
> +	ASSERT(pgoff + len <= PAGE_SIZE);
> +
> +	kaddr = kmap_local_page(page) + pgoff;
> +	shash->tfm = fs_info->csum_shash;
> +
> +	crypto_shash_digest(shash, kaddr, len, csum);
> +	kunmap_local(kaddr);

Nit: the ->tfm assignment can move out of the kmap section.  Otherwise
this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-03  6:49 ` [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
@ 2022-05-03 15:06   ` Christoph Hellwig
  2022-05-04  1:12     ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-03 15:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, May 03, 2022 at 02:49:49PM +0800, Qu Wenruo wrote:
> - For the extra bitmaps
>   We use btrfs_bio::read_repair_{cur|prev}_bitmap.
>   This will add extra 16 bytes to btrfs_bio.
> 
>   This is unavoidable, or we need to allocate extra memory at endio
>   function which can be dead lock prone.
> 
>   Furthermore, pre-allocating bitmap has a hidden benefit, if we're
>   submitting a large read and failed to allocated enough memory for
>   bitmap, we fail the bio, and VFS layer will automatically retry
>   with smaller read range, giving us a higher chance to succeed in
>   next try.

This is a really bad idea.  Now every read needs to pay the quite
large price for corner case of a read repair.  As I mentioned last time
I think a mempool with a few entries that any read repair can dip into
is a much better choice here.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair
  2022-05-03  6:49 ` [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
@ 2022-05-03 15:07   ` Christoph Hellwig
  2022-05-04  1:13     ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-03 15:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

This adds an ununused static function and thus doesn't even compile.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 13/13] btrfs: remove btrfs_inode::io_failure_tree
  2022-05-03  6:49 ` [PATCH 13/13] btrfs: remove btrfs_inode::io_failure_tree Qu Wenruo
@ 2022-05-03 15:07   ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-03 15:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Please fold this into the previous patch that removes all uses of the
tree.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-03 15:06   ` Christoph Hellwig
@ 2022-05-04  1:12     ` Qu Wenruo
  2022-05-04 14:05       ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-04  1:12 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/5/3 23:06, Christoph Hellwig wrote:
> On Tue, May 03, 2022 at 02:49:49PM +0800, Qu Wenruo wrote:
>> - For the extra bitmaps
>>    We use btrfs_bio::read_repair_{cur|prev}_bitmap.
>>    This will add extra 16 bytes to btrfs_bio.
>>
>>    This is unavoidable, or we need to allocate extra memory at endio
>>    function which can be dead lock prone.
>>
>>    Furthermore, pre-allocating bitmap has a hidden benefit, if we're
>>    submitting a large read and failed to allocated enough memory for
>>    bitmap, we fail the bio, and VFS layer will automatically retry
>>    with smaller read range, giving us a higher chance to succeed in
>>    next try.
>
> This is a really bad idea.  Now every read needs to pay the quite
> large price for corner case of a read repair.  As I mentioned last time
> I think a mempool with a few entries that any read repair can dip into
> is a much better choice here.

The problem is, can mempool provide a variable length entry?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair
  2022-05-03 15:07   ` Christoph Hellwig
@ 2022-05-04  1:13     ` Qu Wenruo
  2022-05-04 14:06       ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-04  1:13 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/5/3 23:07, Christoph Hellwig wrote:
> This adds an ununused static function and thus doesn't even compile.

It's just a warning and can pass the compile.

Or we have to have a patch with over 400 lines.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-04  1:12     ` Qu Wenruo
@ 2022-05-04 14:05       ` Christoph Hellwig
  2022-05-04 22:40         ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-04 14:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Wed, May 04, 2022 at 09:12:50AM +0800, Qu Wenruo wrote:
> > This is a really bad idea.  Now every read needs to pay the quite
> > large price for corner case of a read repair.  As I mentioned last time
> > I think a mempool with a few entries that any read repair can dip into
> > is a much better choice here.
> 
> The problem is, can mempool provide a variable length entry?

It can't.  But you can allocate a few different bucket sizes as a
starter or do a multi-level setup where you don't have a bitmap that
operates on the entire bio.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair
  2022-05-04  1:13     ` Qu Wenruo
@ 2022-05-04 14:06       ` Christoph Hellwig
  2022-05-12 17:20         ` David Sterba
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-04 14:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Wed, May 04, 2022 at 09:13:43AM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/5/3 23:07, Christoph Hellwig wrote:
> > This adds an ununused static function and thus doesn't even compile.
> 
> It's just a warning and can pass the compile.
> 
> Or we have to have a patch with over 400 lines.

The latter is the only thing that makes sense.  Patches that are not
self contained also can't be reviewed self contained.  A larger patch
is much better than a random split.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-04 14:05       ` Christoph Hellwig
@ 2022-05-04 22:40         ` Qu Wenruo
  2022-05-12 17:16           ` David Sterba
  0 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-04 22:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/5/4 22:05, Christoph Hellwig wrote:
> On Wed, May 04, 2022 at 09:12:50AM +0800, Qu Wenruo wrote:
>>> This is a really bad idea.  Now every read needs to pay the quite
>>> large price for corner case of a read repair.  As I mentioned last time
>>> I think a mempool with a few entries that any read repair can dip into
>>> is a much better choice here.
>>
>> The problem is, can mempool provide a variable length entry?
>
> It can't.  But you can allocate a few different bucket sizes as a
> starter or do a multi-level setup where you don't have a bitmap that
> operates on the entire bio.

That sounds way more complex than either the original method (just fail
to repair) or the current pre-allocation method.

We need to cover a pretty wide length variable, from the minimal 4K
(only need one bit), to much larger ones (observed bio over 16M, but
only for write).

It would make sense if we have a hard limit on how large a read bio can
be (I have only observed 4M as the biggest bio size for read, and if
limited to 4M, we only need 128bytes for the bitmap and can go mempool).


In fact, the original one (just error out if failed to allocate memory)
is way more robust than you think.

The point here is, if a large read bio failed, VFS will retry with much
smaller block size (normally just one page or one block), and if we even
failed to allocate memory for an u32, we're really screwed up.

Do we really need to go down the rabbit hole of using mempool for
variable length situation? Especially we're not that eager to ensure the
memory allocation.

THanks,
Qu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/13] btrfs: make read repair work in synchronous mode
  2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
                   ` (12 preceding siblings ...)
  2022-05-03  6:49 ` [PATCH 13/13] btrfs: remove btrfs_inode::io_failure_tree Qu Wenruo
@ 2022-05-12 17:08 ` David Sterba
  2022-05-12 23:01   ` Qu Wenruo
  13 siblings, 1 reply; 33+ messages in thread
From: David Sterba @ 2022-05-12 17:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, May 03, 2022 at 02:49:44PM +0800, Qu Wenruo wrote:
 
> By this we can:
> - Remove the re-entry behavior of endio function
>   Now everything is handled inside end_bio_extent_readpage().
> 
> - Remove the io_failure_tree completely
>   As we don't need to record which mirror has failed.
> 
> - Slightly reduced overhead on read repair
>   Now we won't call btrfs_map_bio() for each corrupted sector, as we can
>   merge the sectors into a much larger bio.

I thake this as the summary points for the whole patchset and frankly I
don't see it justified given all the new problems with the preallocation
and shuffling with the structures. What we have now is not perfect and I
would like to get rid of the io_failure_tree too, yet it works.

The main point is probably to stop reentering the functions, though the
idea of having a single tree that tracks the repair state and there are
callbacks dealing with that is not IMHO bad design. The alternative is
to complicate data structures and endio handler. I'm not sure if it's
worth the risk.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-04 22:40         ` Qu Wenruo
@ 2022-05-12 17:16           ` David Sterba
  2022-05-13 10:33             ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: David Sterba @ 2022-05-12 17:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Thu, May 05, 2022 at 06:40:31AM +0800, Qu Wenruo wrote:
> On 2022/5/4 22:05, Christoph Hellwig wrote:
> > On Wed, May 04, 2022 at 09:12:50AM +0800, Qu Wenruo wrote:
> >>> This is a really bad idea.  Now every read needs to pay the quite
> >>> large price for corner case of a read repair.  As I mentioned last time
> >>> I think a mempool with a few entries that any read repair can dip into
> >>> is a much better choice here.
> >>
> >> The problem is, can mempool provide a variable length entry?
> >
> > It can't.  But you can allocate a few different bucket sizes as a
> > starter or do a multi-level setup where you don't have a bitmap that
> > operates on the entire bio.
> 
> That sounds way more complex than either the original method (just fail
> to repair) or the current pre-allocation method.
> 
> We need to cover a pretty wide length variable, from the minimal 4K
> (only need one bit), to much larger ones (observed bio over 16M, but
> only for write).
> 
> It would make sense if we have a hard limit on how large a read bio can
> be (I have only observed 4M as the biggest bio size for read, and if
> limited to 4M, we only need 128bytes for the bitmap and can go mempool).
> 
> 
> In fact, the original one (just error out if failed to allocate memory)
> is way more robust than you think.
> 
> The point here is, if a large read bio failed, VFS will retry with much
> smaller block size (normally just one page or one block), and if we even
> failed to allocate memory for an u32, we're really screwed up.
> 
> Do we really need to go down the rabbit hole of using mempool for
> variable length situation? Especially we're not that eager to ensure the
> memory allocation.

I'm reluctant to use mempools in filesystem on any other layer than for
bios and better just wrapped under the biosets. Anywhere else it creates
potential for high-level deadlocks between various threads. In this
particular case of repair it does not seem that risky, as the only user
of the mempool is the repair submission and does only that, ie. no
interaction with other threads.

I think mempool in this case is a hammer too big, we could have a
per-device preallocated chunk of memory for repair purposes and manage
that by a lock and/or wait queue. There would be only a linear
dependency: request (use) -> io -> wait -> endio (return. If the repair
resource is not available any other request waits. Effectively what
mempools do but much simpler.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair
  2022-05-04 14:06       ` Christoph Hellwig
@ 2022-05-12 17:20         ` David Sterba
  0 siblings, 0 replies; 33+ messages in thread
From: David Sterba @ 2022-05-12 17:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs

On Wed, May 04, 2022 at 07:06:20AM -0700, Christoph Hellwig wrote:
> On Wed, May 04, 2022 at 09:13:43AM +0800, Qu Wenruo wrote:
> > On 2022/5/3 23:07, Christoph Hellwig wrote:
> > > This adds an ununused static function and thus doesn't even compile.
> > 
> > It's just a warning and can pass the compile.
> > 
> > Or we have to have a patch with over 400 lines.
> 
> The latter is the only thing that makes sense.  Patches that are not
> self contained also can't be reviewed self contained.  A larger patch
> is much better than a random split.

We've been doing it the way where a complex function is in a separate
patch and its usage in another one. Yes there are different preferences
and opinions on that. It also depends how one does the review, either in
the mails or in the code. For me it's fine with the split as once I'm at
second patch the function exists in the file. In the mails it's "how
does it work" and "how it is used", so this can be seen as two different
things, thus in two patches. On exception a short function in the same
patch as it's use makes sense, for bigger ones I'm for patch split.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/13] btrfs: make read repair work in synchronous mode
  2022-05-12 17:08 ` [PATCH 00/13] btrfs: make read repair work in synchronous mode David Sterba
@ 2022-05-12 23:01   ` Qu Wenruo
  0 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-12 23:01 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2022/5/13 01:08, David Sterba wrote:
> On Tue, May 03, 2022 at 02:49:44PM +0800, Qu Wenruo wrote:
>
>> By this we can:
>> - Remove the re-entry behavior of endio function
>>    Now everything is handled inside end_bio_extent_readpage().
>>
>> - Remove the io_failure_tree completely
>>    As we don't need to record which mirror has failed.
>>
>> - Slightly reduced overhead on read repair
>>    Now we won't call btrfs_map_bio() for each corrupted sector, as we can
>>    merge the sectors into a much larger bio.
>
> I thake this as the summary points for the whole patchset and frankly I
> don't see it justified given all the new problems with the preallocation
> and shuffling with the structures. What we have now is not perfect and I
> would like to get rid of the io_failure_tree too, yet it works.

Indeed I have no better solution for the bitmap.

I have one very limited optimization for it, e.g. if we hit profiles
without any extra mirrors (like RAID0/SINGLE), we skip the preallocation
and just error out.

But that's all, we have to allocate the memory, and since I'm neither a
fan of mempool (nor a good fit for variable length bitmap), we really
only have two solutions:

- Allocate bitmap when we hit the first error
   The way I did in previous RFC version.
   But it also means, if the memory allocation failed, we will error out
   without reading out the correct mirror. (Christoph is strongly against
   the error path)

- Preallocate bitmap before we submit the bio
   The way I did in this version.

Now I think the first solution would be more sane, no change in
btrfs_bio (which is already big).
And the memory allocation failure will not really cause any bit damage,
VFS will re-try a read with much smaller block size.

>
> The main point is probably to stop reentering the functions, though the
> idea of having a single tree that tracks the repair state and there are
> callbacks dealing with that is not IMHO bad design. The alternative is
> to complicate data structures and endio handler. I'm not sure if it's
> worth the risk.

The point of the repair work is a first step towards less members in
btrfs_bio.

In fact, all logical read/write should not really care about read
repair/csum/mirror num etc.

Thus it can be hidden into lower layer, thus my ultimate goal is to make
logical layer to only use plain bio completely.

But the existing read repair is a blockage, thus why I'm purposing this
patchset.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-12 17:16           ` David Sterba
@ 2022-05-13 10:33             ` Christoph Hellwig
  2022-05-13 10:53               ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-13 10:33 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, Christoph Hellwig, Qu Wenruo, linux-btrfs

FYI, I have a new different approach for the repair code that doesn't
need any bitmap at all.  Should be ready to post in the next days.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-13 10:33             ` Christoph Hellwig
@ 2022-05-13 10:53               ` Qu Wenruo
  2022-05-13 10:57                 ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-13 10:53 UTC (permalink / raw)
  To: Christoph Hellwig, dsterba, Qu Wenruo, linux-btrfs



On 2022/5/13 18:33, Christoph Hellwig wrote:
> FYI, I have a new different approach for the repair code that doesn't
> need any bitmap at all.  Should be ready to post in the next days.

Mind to share the overall idea?

AFAIK current code we don't need to allocate bitmap either, but it uses
io_failure_tree and re-entry the endio function with different mirror
number.

I hope to get rid of both io_failure_tree the re-entry of endio.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-13 10:53               ` Qu Wenruo
@ 2022-05-13 10:57                 ` Christoph Hellwig
  2022-05-13 11:21                   ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-13 10:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, dsterba, Qu Wenruo, linux-btrfs

On Fri, May 13, 2022 at 06:53:13PM +0800, Qu Wenruo wrote:
> Mind to share the overall idea?

Build up the repair read bio as large as possible, then submit it
synchronously, and then do the same for the repair writebck.  Basically
go back to synchronous I/O, but do nice large I/O wherever possible.

> I hope to get rid of both io_failure_tree the re-entry of endio.

Yes, it reuses this and various other bits from your series.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-13 10:57                 ` Christoph Hellwig
@ 2022-05-13 11:21                   ` Qu Wenruo
  2022-05-13 11:23                     ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2022-05-13 11:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dsterba, Qu Wenruo, linux-btrfs



On 2022/5/13 18:57, Christoph Hellwig wrote:
> On Fri, May 13, 2022 at 06:53:13PM +0800, Qu Wenruo wrote:
>> Mind to share the overall idea?
>
> Build up the repair read bio as large as possible, then submit it
> synchronously, and then do the same for the repair writebck.  Basically
> go back to synchronous I/O, but do nice large I/O wherever possible.

OK, then it goes back more like the very first version.

I'm totally fine with that, especially this makes a lot of things much
simpler, but not sure would other guys agree on the performance part.

>
>> I hope to get rid of both io_failure_tree the re-entry of endio.
>
> Yes, it reuses this and various other bits from your series.

Yes, it's kinda of a middle ground between fully asynchronous batched
submit and fully synchronous submit for each sector.

This middle ground is still synchronous, but batched.

Although for the checker patterned corrupted sectors, it falls back to
fully synchronous submit for each sector.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-13 11:21                   ` Qu Wenruo
@ 2022-05-13 11:23                     ` Christoph Hellwig
  2022-05-17 13:32                       ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2022-05-13 11:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, dsterba, Qu Wenruo, linux-btrfs

On Fri, May 13, 2022 at 07:21:55PM +0800, Qu Wenruo wrote:
> This middle ground is still synchronous, but batched.
> 
> Although for the checker patterned corrupted sectors, it falls back to
> fully synchronous submit for each sector.

Indeed.  But how common is that case?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
  2022-05-13 11:23                     ` Christoph Hellwig
@ 2022-05-17 13:32                       ` Qu Wenruo
  0 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2022-05-17 13:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dsterba, Qu Wenruo, linux-btrfs



On 2022/5/13 19:23, Christoph Hellwig wrote:
> On Fri, May 13, 2022 at 07:21:55PM +0800, Qu Wenruo wrote:
>> This middle ground is still synchronous, but batched.
>>
>> Although for the checker patterned corrupted sectors, it falls back to
>> fully synchronous submit for each sector.
>
> Indeed.  But how common is that case?

Well, my initial reason on the super slow repair (but super simpler) is
even more extreme:

   How common is data corruption other than the crafted test case?

I know there are data corruption in real world, but I doubt we would
have even more than 1 corrupted sector in a large (128M) extent.

But since your newer idea will also assemble a larger bio without really
causing new facilities, it looks pretty reasonable to do that.

So, yes, I'm pretty happy with the idea, as long as there is no strong
objection against the possible but rare performance drop on it.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-05-17 13:33 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-03  6:49 [PATCH 00/13] btrfs: make read repair work in synchronous mode Qu Wenruo
2022-05-03  6:49 ` [PATCH 01/13] btrfs: introduce a pure data checksum checking helper Qu Wenruo
2022-05-03 15:03   ` Christoph Hellwig
2022-05-03  6:49 ` [PATCH 02/13] btrfs: quit early if the fs has no RAID56 support for raid56 related checks Qu Wenruo
2022-05-03  6:49 ` [PATCH 03/13] btrfs: save the original bi_iter into btrfs_bio for buffered read Qu Wenruo
2022-05-03  6:49 ` [PATCH 04/13] btrfs: remove duplicated parameters from submit_data_read_repair() Qu Wenruo
2022-05-03  6:49 ` [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
2022-05-03 15:06   ` Christoph Hellwig
2022-05-04  1:12     ` Qu Wenruo
2022-05-04 14:05       ` Christoph Hellwig
2022-05-04 22:40         ` Qu Wenruo
2022-05-12 17:16           ` David Sterba
2022-05-13 10:33             ` Christoph Hellwig
2022-05-13 10:53               ` Qu Wenruo
2022-05-13 10:57                 ` Christoph Hellwig
2022-05-13 11:21                   ` Qu Wenruo
2022-05-13 11:23                     ` Christoph Hellwig
2022-05-17 13:32                       ` Qu Wenruo
2022-05-03  6:49 ` [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
2022-05-03 15:07   ` Christoph Hellwig
2022-05-04  1:13     ` Qu Wenruo
2022-05-04 14:06       ` Christoph Hellwig
2022-05-12 17:20         ` David Sterba
2022-05-03  6:49 ` [PATCH 07/13] btrfs: introduce a helper to repair from one mirror Qu Wenruo
2022-05-03  6:49 ` [PATCH 08/13] btrfs: allow btrfs read repair to submit writes in asynchronous mode Qu Wenruo
2022-05-03  6:49 ` [PATCH 09/13] btrfs: handle RAID56 read repair differently Qu Wenruo
2022-05-03  6:49 ` [PATCH 10/13] btrfs: switch buffered read to the new read repair routine Qu Wenruo
2022-05-03  6:49 ` [PATCH 11/13] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl Qu Wenruo
2022-05-03  6:49 ` [PATCH 12/13] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
2022-05-03  6:49 ` [PATCH 13/13] btrfs: remove btrfs_inode::io_failure_tree Qu Wenruo
2022-05-03 15:07   ` Christoph Hellwig
2022-05-12 17:08 ` [PATCH 00/13] btrfs: make read repair work in synchronous mode David Sterba
2022-05-12 23:01   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.