[PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector
@ 2021-05-03  2:08 Qu Wenruo
  2021-05-03  2:08 ` [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation() Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-05-03  2:08 UTC (permalink / raw)
  To: linux-btrfs

Btrfs read time repair has to handle two different cases when a corruption
or read failure is hit:
- The failed bio contains only one sector
  Then it only need to find a good copy

- The failed bio contains several sectors
  Then it needs to find which sectors really need to be repaired

But this different behaviors are not really needed, as we can teach btrfs
to only submit read repair for each corrupted sector.
By this, we only need to handle the one-sector corruption case.

This not only makes the code smaller and simpler, but also benefits subpage,
allow subpage case to use the same infrastructure.

For current subpage code, we hacked the read repair code to make full
bvec read repair, which has less granularity compared to regular sector
size.

The code is still based on subpage branch, but can be forward ported to
non-subpage code basis with minor conflicts.

Changelog:
v2:
- Split the original patch
  Now we have two preparation patches, then the core change.
  And finally a cleanup.

- Fix the uninitialize @error_bitmap when the bio read fails.

v3:
- Fix the return value type mismatch in repair_one_sector()
  An error happens in v2 patch split, which can lead to hang when
  we can't repair the error.

Qu Wenruo (4):
  btrfs: remove the dead branch in btrfs_io_needs_validation()
  btrfs: make btrfs_verify_data_csum() to return a bitmap
  btrfs: submit read time repair only for each corrupted sector
  btrfs: remove io_failure_record::in_validation

 fs/btrfs/ctree.h     |   4 +-
 fs/btrfs/extent_io.c | 262 +++++++++++++++++++------------------------
 fs/btrfs/extent_io.h |   4 +-
 fs/btrfs/inode.c     |  19 +++-
 4 files changed, 134 insertions(+), 155 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation()
  2021-05-03  2:08 [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
@ 2021-05-03  2:08 ` Qu Wenruo
  2021-05-03 17:05   ` David Sterba
  2021-05-10 20:14   ` David Sterba
  2021-05-03  2:08 ` [PATCH v3 2/4] btrfs: make btrfs_verify_data_csum() to return a bitmap Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-05-03  2:08 UTC (permalink / raw)
  To: linux-btrfs

In function btrfs_io_needs_validation() we are ensured to get a
non-cloned bio.
The only caller, end_bio_extent_readpage(), already has an ASSERT() to
make sure we only get non-cloned bios.

Thus the (bio_flagged(bio, BIO_CLONED)) branch will never get executed.

Remove the dead branch and updated the comment.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 29 +++++++----------------------
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 14ab11381d49..0787fae5f7f1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2644,8 +2644,10 @@ static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
 
 static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
 {
+	struct bio_vec *bvec;
 	u64 len = 0;
 	const u32 blocksize = inode->i_sb->s_blocksize;
+	int i;
 
 	/*
 	 * If bi_status is BLK_STS_OK, then this was a checksum error, not an
@@ -2669,30 +2671,13 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
 	if (blocksize < PAGE_SIZE)
 		return false;
 	/*
-	 * We need to validate each sector individually if the failed I/O was
-	 * for multiple sectors.
-	 *
-	 * There are a few possible bios that can end up here:
-	 * 1. A buffered read bio, which is not cloned.
-	 * 2. A direct I/O read bio, which is cloned.
-	 * 3. A (buffered or direct) repair bio, which is not cloned.
-	 *
-	 * For cloned bios (case 2), we can get the size from
-	 * btrfs_io_bio->iter; for non-cloned bios (cases 1 and 3), we can get
-	 * it from the bvecs.
+	 * We're ensured we won't get cloned bio in end_bio_extent_readpage(),
+	 * thus we can get the length from the bvecs.
 	 */
-	if (bio_flagged(bio, BIO_CLONED)) {
-		if (btrfs_io_bio(bio)->iter.bi_size > blocksize)
+	bio_for_each_bvec_all(bvec, bio, i) {
+		len += bvec->bv_len;
+		if (len > blocksize)
 			return true;
-	} else {
-		struct bio_vec *bvec;
-		int i;
-
-		bio_for_each_bvec_all(bvec, bio, i) {
-			len += bvec->bv_len;
-			if (len > blocksize)
-				return true;
-		}
 	}
 	return false;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/4] btrfs: make btrfs_verify_data_csum() to return a bitmap
  2021-05-03  2:08 [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
  2021-05-03  2:08 ` [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation() Qu Wenruo
@ 2021-05-03  2:08 ` Qu Wenruo
  2021-05-03  2:08 ` [PATCH v3 3/4] btrfs: submit read time repair only for each corrupted sector Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-05-03  2:08 UTC (permalink / raw)
  To: linux-btrfs

This will provide the basis for later per-sector repair for subpage,
while still keep the existing code happy.

As if all csum matches, the return value is still 0.
Only when csum mismatches, the return value is different.

The new return value will be a bitmap, for 4K sectorsize and 4K page
size, it will be either 1, instead of the old -EIO.

But for 4K sectorsize and 64K page size, aka subpage case, since the
bvec can contain multiple sectors, knowing which sectors are corrupted
will allow us to submit repair only for corrupted sectors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |  4 ++--
 fs/btrfs/inode.c | 17 ++++++++++++-----
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 80670a631714..7bb4212b90d3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3100,8 +3100,8 @@ u64 btrfs_file_extent_end(const struct btrfs_path *path);
 /* inode.c */
 blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 				   int mirror_num, unsigned long bio_flags);
-int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
-			   struct page *page, u64 start, u64 end);
+unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
+				    struct page *page, u64 start, u64 end);
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
 					   u64 start, u64 len);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 294d8d98280d..e9db33afcb5d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3135,15 +3135,19 @@ static int check_data_csum(struct inode *inode, struct btrfs_io_bio *io_bio,
  * @bio_offset:	offset to the beginning of the bio (in bytes)
  * @start:	file offset of the range start
  * @end:	file offset of the range end (inclusive)
+ *
+ * Return a bitmap where bit set means a csum mismatch, and bit not set means
+ * csum match.
  */
-int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
-			   struct page *page, u64 start, u64 end)
+unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
+				    struct page *page, u64 start, u64 end)
 {
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	const u32 sectorsize = root->fs_info->sectorsize;
 	u32 pg_off;
+	unsigned int result = 0;
 
 	if (PageChecked(page)) {
 		ClearPageChecked(page);
@@ -3171,10 +3175,13 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
 
 		ret = check_data_csum(inode, io_bio, bio_offset, page, pg_off,
 				      page_offset(page) + pg_off);
-		if (ret < 0)
-			return -EIO;
+		if (ret < 0) {
+			int nr_bit = (pg_off - offset_in_page(start)) /
+				     sectorsize;
+			result |= (1 << nr_bit);
+		}
 	}
-	return 0;
+	return result;
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/4] btrfs: submit read time repair only for each corrupted sector
  2021-05-03  2:08 [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
  2021-05-03  2:08 ` [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation() Qu Wenruo
  2021-05-03  2:08 ` [PATCH v3 2/4] btrfs: make btrfs_verify_data_csum() to return a bitmap Qu Wenruo
@ 2021-05-03  2:08 ` Qu Wenruo
  2021-05-10 20:32   ` David Sterba
  2021-05-03  2:08 ` [PATCH v3 4/4] btrfs: remove io_failure_record::in_validation Qu Wenruo
  2021-05-10 20:41 ` [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector David Sterba
  4 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2021-05-03  2:08 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair.

But we can avoid all these extra mechanism if we submit the repair for
each sector.

By this, each read repair can be easily handled without the need to
verify which sector is corrupted.

This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.

So this patch will:
- Introduce repair_one_sector()
  The main code submitting repair, which is more or less the same as old
  btrfs_submit_read_repair().
  But this time, it only repair one sector.

- Make btrfs_submit_read_repair() to handle sectors differently
  For sectors without csum error, just release them like what we did
  in end_bio_extent_readpage().
  Although in this context we don't have process_extent structure, thus
  we have to do extent tree operations sector by sector.
  This is slower, but since it's only in csum mismatch path, it should
  be fine.

  For sectors with csum error, we submit repair for each sector.

This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 144 ++++++++++++++++++++++++++++++-------------
 fs/btrfs/extent_io.h |   1 +
 fs/btrfs/inode.c     |   2 +-
 3 files changed, 104 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0787fae5f7f1..dca204730d17 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2494,7 +2494,7 @@ void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
 }
 
 static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
-							     u64 start, u64 end)
+							     u64 start)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct io_failure_record *failrec;
@@ -2502,6 +2502,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	const u32 sectorsize = fs_info->sectorsize;
 	int ret;
 	u64 logical;
 
@@ -2525,7 +2526,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 		return ERR_PTR(-ENOMEM);
 
 	failrec->start = start;
-	failrec->len = end - start + 1;
+	failrec->len = sectorsize;
 	failrec->this_mirror = 0;
 	failrec->bio_flags = 0;
 	failrec->in_validation = 0;
@@ -2564,12 +2565,13 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	free_extent_map(em);
 
 	/* Set the bits in the private failure tree */
-	ret = set_extent_bits(failure_tree, start, end,
+	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
 			      EXTENT_LOCKED | EXTENT_DIRTY);
 	if (ret >= 0) {
 		ret = set_state_failrec(failure_tree, start, failrec);
 		/* Set the bits in the inode's tree */
-		ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED);
+		ret = set_extent_bits(tree, start, start + sectorsize - 1,
+				      EXTENT_DAMAGED);
 	} else if (ret < 0) {
 		kfree(failrec);
 		return ERR_PTR(ret);
@@ -2682,11 +2684,11 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
 	return false;
 }
 
-blk_status_t btrfs_submit_read_repair(struct inode *inode,
-				      struct bio *failed_bio, u32 bio_offset,
-				      struct page *page, unsigned int pgoff,
-				      u64 start, u64 end, int failed_mirror,
-				      submit_bio_hook_t *submit_bio_hook)
+static int repair_one_sector(struct inode *inode,
+			     struct bio *failed_bio, u32 bio_offset,
+			     struct page *page, unsigned int pgoff,
+			     u64 start, int failed_mirror,
+			     submit_bio_hook_t *submit_bio_hook)
 {
 	struct io_failure_record *failrec;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2704,16 +2706,22 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
 
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
-	failrec = btrfs_get_io_failure_record(inode, start, end);
+	failrec = btrfs_get_io_failure_record(inode, start);
 	if (IS_ERR(failrec))
-		return errno_to_blk_status(PTR_ERR(failrec));
-
-	need_validation = btrfs_io_needs_validation(inode, failed_bio);
+		return PTR_ERR(failrec);
 
+	/*
+	 * We will only submit repair for one sector, thus we don't need
+	 * extra validation anymore.
+	 *
+	 * TODO: All those extra validation related code will be cleaned up
+	 * later.
+	 */
+	need_validation = false;
 	if (!btrfs_check_repairable(inode, need_validation, failrec,
 				    failed_mirror)) {
 		free_io_failure(failure_tree, tree, failrec);
-		return BLK_STS_IOERR;
+		return -EIO;
 	}
 
 	repair_bio = btrfs_io_bio_alloc(1);
@@ -2747,7 +2755,79 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
 		free_io_failure(failure_tree, tree, failrec);
 		bio_put(repair_bio);
 	}
-	return status;
+	return blk_status_to_errno(status);
+}
+
+static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+
+	ASSERT(page_offset(page) <= start &&
+		start + len <= page_offset(page) + PAGE_SIZE);
+
+	if (uptodate) {
+		btrfs_page_set_uptodate(fs_info, page, start, len);
+	} else {
+		btrfs_page_clear_uptodate(fs_info, page, start, len);
+		btrfs_page_set_error(fs_info, page, start, len);
+	}
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		unlock_page(page);
+	else if (is_data_inode(page->mapping->host))
+		/*
+		 * For subpage data, unlock the page if we're the last reader.
+		 * For subpage metadata, page lock is not utilized for read.
+		 */
+		btrfs_subpage_end_reader(fs_info, page, start, len);
+}
+
+blk_status_t btrfs_submit_read_repair(struct inode *inode,
+				      struct bio *failed_bio, u32 bio_offset,
+				      struct page *page, unsigned int pgoff,
+				      u64 start, u64 end, int failed_mirror,
+				      unsigned int error_bitmap,
+				      submit_bio_hook_t *submit_bio_hook)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	const u32 sectorsize = fs_info->sectorsize;
+	int nr_bits = (end + 1 - start) / sectorsize;
+	int i;
+
+	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
+
+	/* We're here because we had some read errors or csum mismatch */
+	ASSERT(error_bitmap);
+
+	/* Iterate through all the sectors in the range */
+	for (i = 0; i < nr_bits; i++) {
+		int ret;
+		unsigned int offset = i * sectorsize;
+
+		if (!(error_bitmap & (1 << i))) {
+			struct extent_state *cached = NULL;
+
+			/* This sector has no error, just finish the read. */
+			end_page_read(page, true, start + offset, sectorsize);
+			set_extent_uptodate(&BTRFS_I(inode)->io_tree,
+					start + offset,
+					start + offset + sectorsize - 1,
+					&cached, GFP_ATOMIC);
+			unlock_extent_cached_atomic(&BTRFS_I(inode)->io_tree,
+					start + offset,
+					start + offset + sectorsize - 1,
+					&cached);
+			continue;
+		}
+
+		/* This sector has been corrupted, repair it */
+		ret = repair_one_sector(inode, failed_bio, bio_offset + offset,
+				page, pgoff + offset, start + offset,
+				failed_mirror, submit_bio_hook);
+		if (ret < 0)
+			return errno_to_blk_status(ret);
+	}
+	return BLK_STS_OK;
 }
 
 /* lots and lots of room for performance fixes in the end_bio funcs */
@@ -2904,30 +2984,6 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
 	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
 }
 
-static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
-
-	ASSERT(page_offset(page) <= start &&
-		start + len <= page_offset(page) + PAGE_SIZE);
-
-	if (uptodate) {
-		btrfs_page_set_uptodate(fs_info, page, start, len);
-	} else {
-		btrfs_page_clear_uptodate(fs_info, page, start, len);
-		btrfs_page_set_error(fs_info, page, start, len);
-	}
-
-	if (fs_info->sectorsize == PAGE_SIZE)
-		unlock_page(page);
-	else if (is_data_inode(page->mapping->host))
-		/*
-		 * For subpage data, unlock the page if we're the last reader.
-		 * For subpage metadata, page lock is not utilized for read.
-		 */
-		btrfs_subpage_end_reader(fs_info, page, start, len);
-}
-
 /*
  * Find extent buffer for a givne bytenr.
  *
@@ -2990,6 +3046,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
+		unsigned int error_bitmap = (unsigned int)-1;
 		u64 start;
 		u64 end;
 		u32 len;
@@ -3024,12 +3081,14 @@ static void end_bio_extent_readpage(struct bio *bio)
 
 		mirror = io_bio->mirror_num;
 		if (likely(uptodate)) {
-			if (is_data_inode(inode))
-				ret = btrfs_verify_data_csum(io_bio,
+			if (is_data_inode(inode)) {
+				error_bitmap = btrfs_verify_data_csum(io_bio,
 						bio_offset, page, start, end);
-			else
+				ret = error_bitmap;
+			} else {
 				ret = btrfs_validate_metadata_buffer(io_bio,
 					page, start, end, mirror);
+			}
 			if (ret)
 				uptodate = 0;
 			else
@@ -3058,6 +3117,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 						page,
 						start - page_offset(page),
 						start, end, mirror,
+						error_bitmap,
 						btrfs_submit_data_bio)) {
 				uptodate = !bio->bi_status;
 				ASSERT(bio_offset + len > bio_offset);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 1d7bc27719da..89820028c4bc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -312,6 +312,7 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
 				      struct bio *failed_bio, u32 bio_offset,
 				      struct page *page, unsigned int pgoff,
 				      u64 start, u64 end, int failed_mirror,
+				      unsigned int error_bitmap,
 				      submit_bio_hook_t *submit_bio_hook);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e9db33afcb5d..4fc6e6766234 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7941,7 +7941,7 @@ static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
 							bvec.bv_page, pgoff,
 							start,
 							start + sectorsize - 1,
-							io_bio->mirror_num,
+							io_bio->mirror_num, 1,
 							submit_dio_repair_bio);
 				if (status)
 					err = status;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 4/4] btrfs: remove io_failure_record::in_validation
  2021-05-03  2:08 [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-05-03  2:08 ` [PATCH v3 3/4] btrfs: submit read time repair only for each corrupted sector Qu Wenruo
@ 2021-05-03  2:08 ` Qu Wenruo
  2021-05-10 20:41 ` [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector David Sterba
  4 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-05-03  2:08 UTC (permalink / raw)
  To: linux-btrfs

The io_failure_record::in_validation was introduced to handle failed bio
which cross several sectors.
In such case, we still need to verify which sectors are corrupted.

But since we're changed the way how we handle corrupted sectors, by only
submitting repair for each corrupted sector, there is no need for extra
validation any more.

This patch will cleanup all io_failure_record::in_validation related
code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 113 ++++++++-----------------------------------
 fs/btrfs/extent_io.h |   3 +-
 2 files changed, 21 insertions(+), 95 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dca204730d17..11dae53525f6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2426,13 +2426,6 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 
 	BUG_ON(!failrec->this_mirror);
 
-	if (failrec->in_validation) {
-		/* there was no real error, just free the record */
-		btrfs_debug(fs_info,
-			"clean_io_failure: freeing dummy error at %llu",
-			failrec->start);
-		goto out;
-	}
 	if (sb_rdonly(fs_info->sb))
 		goto out;
 
@@ -2509,9 +2502,8 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	failrec = get_state_failrec(failure_tree, start);
 	if (!IS_ERR(failrec)) {
 		btrfs_debug(fs_info,
-			"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d",
-			failrec->logical, failrec->start, failrec->len,
-			failrec->in_validation);
+	"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu",
+			failrec->logical, failrec->start, failrec->len);
 		/*
 		 * when data can be on disk more than twice, add to failrec here
 		 * (e.g. with a list for failed_mirror) to make
@@ -2529,7 +2521,6 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	failrec->len = sectorsize;
 	failrec->this_mirror = 0;
 	failrec->bio_flags = 0;
-	failrec->in_validation = 0;
 
 	read_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, start, failrec->len);
@@ -2580,7 +2571,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	return failrec;
 }
 
-static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
+static bool btrfs_check_repairable(struct inode *inode,
 				   struct io_failure_record *failrec,
 				   int failed_mirror)
 {
@@ -2600,39 +2591,22 @@ static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
 		return false;
 	}
 
+	/* The failure record should only contain one sector */
+	ASSERT(failrec->len == fs_info->sectorsize);
+
 	/*
-	 * there are two premises:
-	 *	a) deliver good data to the caller
-	 *	b) correct the bad sectors on disk
+	 * There are two premises:
+	 * a) deliver good data to the caller
+	 * b) correct the bad sectors on disk
+	 *
+	 * Since we're only doing repair for one sector, we only need to get
+	 * a good copy of the failed sector and if we succeed, we have setup
+	 * everything for repair_io_failure to do the rest for us.
 	 */
-	if (needs_validation) {
-		/*
-		 * to fulfill b), we need to know the exact failing sectors, as
-		 * we don't want to rewrite any more than the failed ones. thus,
-		 * we need separate read requests for the failed bio
-		 *
-		 * if the following BUG_ON triggers, our validation request got
-		 * merged. we need separate requests for our algorithm to work.
-		 */
-		BUG_ON(failrec->in_validation);
-		failrec->in_validation = 1;
-		failrec->this_mirror = failed_mirror;
-	} else {
-		/*
-		 * we're ready to fulfill a) and b) alongside. get a good copy
-		 * of the failed sector and if we succeed, we have setup
-		 * everything for repair_io_failure to do the rest for us.
-		 */
-		if (failrec->in_validation) {
-			BUG_ON(failrec->this_mirror != failed_mirror);
-			failrec->in_validation = 0;
-			failrec->this_mirror = 0;
-		}
-		failrec->failed_mirror = failed_mirror;
+	failrec->failed_mirror = failed_mirror;
+	failrec->this_mirror++;
+	if (failrec->this_mirror == failed_mirror)
 		failrec->this_mirror++;
-		if (failrec->this_mirror == failed_mirror)
-			failrec->this_mirror++;
-	}
 
 	if (failrec->this_mirror > num_copies) {
 		btrfs_debug(fs_info,
@@ -2644,46 +2618,6 @@ static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
 	return true;
 }
 
-static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
-{
-	struct bio_vec *bvec;
-	u64 len = 0;
-	const u32 blocksize = inode->i_sb->s_blocksize;
-	int i;
-
-	/*
-	 * If bi_status is BLK_STS_OK, then this was a checksum error, not an
-	 * I/O error. In this case, we already know exactly which sector was
-	 * bad, so we don't need to validate.
-	 */
-	if (bio->bi_status == BLK_STS_OK)
-		return false;
-
-	/*
-	 * For subpage case, read bio are always submitted as multiple-sector
-	 * bio if the range is in the same page.
-	 * For now, let's just skip the validation, and do page sized repair.
-	 *
-	 * This reduce the granularity for repair, meaning if we have two
-	 * copies with different csum mismatch at different location, we're
-	 * unable to repair in subpage case.
-	 *
-	 * TODO: Make validation code to be fully subpage compatible
-	 */
-	if (blocksize < PAGE_SIZE)
-		return false;
-	/*
-	 * We're ensured we won't get cloned bio in end_bio_extent_readpage(),
-	 * thus we can get the length from the bvecs.
-	 */
-	bio_for_each_bvec_all(bvec, bio, i) {
-		len += bvec->bv_len;
-		if (len > blocksize)
-			return true;
-	}
-	return false;
-}
-
 static int repair_one_sector(struct inode *inode,
 			     struct bio *failed_bio, u32 bio_offset,
 			     struct page *page, unsigned int pgoff,
@@ -2696,7 +2630,6 @@ static int repair_one_sector(struct inode *inode,
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct btrfs_io_bio *failed_io_bio = btrfs_io_bio(failed_bio);
 	const int icsum = bio_offset >> fs_info->sectorsize_bits;
-	bool need_validation;
 	struct bio *repair_bio;
 	struct btrfs_io_bio *repair_io_bio;
 	blk_status_t status;
@@ -2713,13 +2646,9 @@ static int repair_one_sector(struct inode *inode,
 	/*
 	 * We will only submit repair for one sector, thus we don't need
 	 * extra validation anymore.
-	 *
-	 * TODO: All those extra validation related code will be cleaned up
-	 * later.
 	 */
-	need_validation = false;
-	if (!btrfs_check_repairable(inode, need_validation, failrec,
-				    failed_mirror)) {
+
+	if (!btrfs_check_repairable(inode, failrec, failed_mirror)) {
 		free_io_failure(failure_tree, tree, failrec);
 		return -EIO;
 	}
@@ -2727,8 +2656,6 @@ static int repair_one_sector(struct inode *inode,
 	repair_bio = btrfs_io_bio_alloc(1);
 	repair_io_bio = btrfs_io_bio(repair_bio);
 	repair_bio->bi_opf = REQ_OP_READ;
-	if (need_validation)
-		repair_bio->bi_opf |= REQ_FAILFAST_DEV;
 	repair_bio->bi_end_io = failed_bio->bi_end_io;
 	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
 	repair_bio->bi_private = failed_bio->bi_private;
@@ -2746,8 +2673,8 @@ static int repair_one_sector(struct inode *inode,
 	repair_io_bio->iter = repair_bio->bi_iter;
 
 	btrfs_debug(btrfs_sb(inode->i_sb),
-"repair read error: submitting new read to mirror %d, in_validation=%d",
-		    failrec->this_mirror, failrec->in_validation);
+		    "repair read error: submitting new read to mirror %d",
+		    failrec->this_mirror);
 
 	status = submit_bio_hook(inode, repair_bio, failrec->this_mirror,
 				 failrec->bio_flags);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 89820028c4bc..95affaf2e503 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -292,7 +292,7 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
  * When IO fails, either with EIO or csum verification fails, we
  * try other mirrors that might have a good copy of the data.  This
  * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the page is set up to date
+ * mirrors.  If another mirror has good data, the sector is set up to date
  * and things continue.  If a good mirror can't be found, the original
  * bio end_io callback is called to indicate things have failed.
  */
@@ -304,7 +304,6 @@ struct io_failure_record {
 	unsigned long bio_flags;
 	int this_mirror;
 	int failed_mirror;
-	int in_validation;
 };
 
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation()
  2021-05-03  2:08 ` [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation() Qu Wenruo
@ 2021-05-03 17:05   ` David Sterba
  2021-05-03 23:39     ` Qu Wenruo
  2021-05-10 20:14   ` David Sterba
  1 sibling, 1 reply; 16+ messages in thread
From: David Sterba @ 2021-05-03 17:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 03, 2021 at 10:08:53AM +0800, Qu Wenruo wrote:
> In function btrfs_io_needs_validation() we are ensured to get a
> non-cloned bio.
> The only caller, end_bio_extent_readpage(), already has an ASSERT() to
> make sure we only get non-cloned bios.
> 
> Thus the (bio_flagged(bio, BIO_CLONED)) branch will never get executed.
> 
> Remove the dead branch and updated the comment.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 29 +++++++----------------------
>  1 file changed, 7 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 14ab11381d49..0787fae5f7f1 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2644,8 +2644,10 @@ static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
>  
>  static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
>  {
> +	struct bio_vec *bvec;
>  	u64 len = 0;
>  	const u32 blocksize = inode->i_sb->s_blocksize;
> +	int i;
>  
>  	/*
>  	 * If bi_status is BLK_STS_OK, then this was a checksum error, not an
> @@ -2669,30 +2671,13 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
>  	if (blocksize < PAGE_SIZE)
>  		return false;
>  	/*
> -	 * We need to validate each sector individually if the failed I/O was
> -	 * for multiple sectors.
> -	 *
> -	 * There are a few possible bios that can end up here:
> -	 * 1. A buffered read bio, which is not cloned.
> -	 * 2. A direct I/O read bio, which is cloned.

Ok, so the cloned bio was expected only due to direct io but now it's
done via iomap so it makes sense to simplify it.

> -	 * 3. A (buffered or direct) repair bio, which is not cloned.
> -	 *
> -	 * For cloned bios (case 2), we can get the size from
> -	 * btrfs_io_bio->iter; for non-cloned bios (cases 1 and 3), we can get
> -	 * it from the bvecs.
> +	 * We're ensured we won't get cloned bio in end_bio_extent_readpage(),
> +	 * thus we can get the length from the bvecs.
>  	 */
> -	if (bio_flagged(bio, BIO_CLONED)) {
> -		if (btrfs_io_bio(bio)->iter.bi_size > blocksize)
> +	bio_for_each_bvec_all(bvec, bio, i) {
> +		len += bvec->bv_len;
> +		if (len > blocksize)
>  			return true;
> -	} else {
> -		struct bio_vec *bvec;
> -		int i;
> -
> -		bio_for_each_bvec_all(bvec, bio, i) {
> -			len += bvec->bv_len;
> -			if (len > blocksize)
> -				return true;
> -		}
>  	}
>  	return false;
>  }
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation()
  2021-05-03 17:05   ` David Sterba
@ 2021-05-03 23:39     ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-05-03 23:39 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Omar Sandoval



On 2021/5/4 上午1:05, David Sterba wrote:
> On Mon, May 03, 2021 at 10:08:53AM +0800, Qu Wenruo wrote:
>> In function btrfs_io_needs_validation() we are ensured to get a
>> non-cloned bio.
>> The only caller, end_bio_extent_readpage(), already has an ASSERT() to
>> make sure we only get non-cloned bios.
>>
>> Thus the (bio_flagged(bio, BIO_CLONED)) branch will never get executed.
>>
>> Remove the dead branch and updated the comment.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 29 +++++++----------------------
>>   1 file changed, 7 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 14ab11381d49..0787fae5f7f1 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2644,8 +2644,10 @@ static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
>>
>>   static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
>>   {
>> +	struct bio_vec *bvec;
>>   	u64 len = 0;
>>   	const u32 blocksize = inode->i_sb->s_blocksize;
>> +	int i;
>>
>>   	/*
>>   	 * If bi_status is BLK_STS_OK, then this was a checksum error, not an
>> @@ -2669,30 +2671,13 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
>>   	if (blocksize < PAGE_SIZE)
>>   		return false;
>>   	/*
>> -	 * We need to validate each sector individually if the failed I/O was
>> -	 * for multiple sectors.
>> -	 *
>> -	 * There are a few possible bios that can end up here:
>> -	 * 1. A buffered read bio, which is not cloned.
>> -	 * 2. A direct I/O read bio, which is cloned.
>
> Ok, so the cloned bio was expected only due to direct io but now it's
> done via iomap so it makes sense to simplify it.

I didn't dig into the history too deeply, but I don't think it's just
the iomap refactor.

As the ASSERT() in end_bio_extent_readpage() was added in c09abff87f90
("btrfs: cloned bios must not be iterated by bio_for_each_segment_all"),
which is from 2017, way before the iomap dio change.

So the dead code is there for over 4 years at least.


BTW, since you're mentioning iomap dio, the patchset is inspired by the
commit 77d5d6893106 ("btrfs: unify buffered and direct I/O read
repair"), kudos to Omar, which not only unify the code path, but also
considered subpage from the very beginning.

Thanks,
Qu

>
>> -	 * 3. A (buffered or direct) repair bio, which is not cloned.
>> -	 *
>> -	 * For cloned bios (case 2), we can get the size from
>> -	 * btrfs_io_bio->iter; for non-cloned bios (cases 1 and 3), we can get
>> -	 * it from the bvecs.
>> +	 * We're ensured we won't get cloned bio in end_bio_extent_readpage(),
>> +	 * thus we can get the length from the bvecs.
>>   	 */
>> -	if (bio_flagged(bio, BIO_CLONED)) {
>> -		if (btrfs_io_bio(bio)->iter.bi_size > blocksize)
>> +	bio_for_each_bvec_all(bvec, bio, i) {
>> +		len += bvec->bv_len;
>> +		if (len > blocksize)
>>   			return true;
>> -	} else {
>> -		struct bio_vec *bvec;
>> -		int i;
>> -
>> -		bio_for_each_bvec_all(bvec, bio, i) {
>> -			len += bvec->bv_len;
>> -			if (len > blocksize)
>> -				return true;
>> -		}
>>   	}
>>   	return false;
>>   }
>> --
>> 2.31.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation()
  2021-05-03  2:08 ` [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation() Qu Wenruo
  2021-05-03 17:05   ` David Sterba
@ 2021-05-10 20:14   ` David Sterba
  2021-05-10 23:56     ` Qu Wenruo
  1 sibling, 1 reply; 16+ messages in thread
From: David Sterba @ 2021-05-10 20:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 03, 2021 at 10:08:53AM +0800, Qu Wenruo wrote:
>  	/*
> -	 * We need to validate each sector individually if the failed I/O was
> -	 * for multiple sectors.
> -	 *
> -	 * There are a few possible bios that can end up here:
> -	 * 1. A buffered read bio, which is not cloned.
> -	 * 2. A direct I/O read bio, which is cloned.
> -	 * 3. A (buffered or direct) repair bio, which is not cloned.
> -	 *
> -	 * For cloned bios (case 2), we can get the size from
> -	 * btrfs_io_bio->iter; for non-cloned bios (cases 1 and 3), we can get
> -	 * it from the bvecs.
> +	 * We're ensured we won't get cloned bio in end_bio_extent_readpage(),
> +	 * thus we can get the length from the bvecs.
>  	 */
> -	if (bio_flagged(bio, BIO_CLONED)) {
> -		if (btrfs_io_bio(bio)->iter.bi_size > blocksize)
> +	bio_for_each_bvec_all(bvec, bio, i) {
> +		len += bvec->bv_len;
> +		if (len > blocksize)
>  			return true;

I've looked if the bio cloning is used at all so we could potentially
get rid of all the BIO_CLONED assertions. There are still two cases:

* btrfs_submit_direct calling btrfs_bio_clone_partial
* btrfs_map_bio calling btrfs_bio_clone

So in this case I'd rather add an assertion before
bio_for_each_bvec_all, as this fits the usecase "this never happens".
The original assertions were added everywhere once the bio iteration
behaviour changed without much notice, so we need to be cautious.

Applied with the following fixup

--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2616,6 +2616,7 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
         * We're ensured we won't get cloned bio in end_bio_extent_readpage(),
         * thus we can get the length from the bvecs.
         */
+       ASSERT(!bio_flagged(bio, BIO_CLONED));
        bio_for_each_bvec_all(bvec, bio, i) {
                len += bvec->bv_len;
                if (len > blocksize)
---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/4] btrfs: submit read time repair only for each corrupted sector
  2021-05-03  2:08 ` [PATCH v3 3/4] btrfs: submit read time repair only for each corrupted sector Qu Wenruo
@ 2021-05-10 20:32   ` David Sterba
  2021-05-11  1:42     ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: David Sterba @ 2021-05-10 20:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 03, 2021 at 10:08:55AM +0800, Qu Wenruo wrote:
> Currently btrfs_submit_read_repair() has some extra check on whether the
> failed bio needs extra validation for repair.
> 
> But we can avoid all these extra mechanism if we submit the repair for
> each sector.
> 
> By this, each read repair can be easily handled without the need to
> verify which sector is corrupted.
> 
> This will also benefit subpage, as one subpage bvec can contain several
> sectors, making the extra verification more complex.
> 
> So this patch will:
> - Introduce repair_one_sector()
>   The main code submitting repair, which is more or less the same as old
>   btrfs_submit_read_repair().
>   But this time, it only repair one sector.
> 
> - Make btrfs_submit_read_repair() to handle sectors differently
>   For sectors without csum error, just release them like what we did
>   in end_bio_extent_readpage().
>   Although in this context we don't have process_extent structure, thus
>   we have to do extent tree operations sector by sector.
>   This is slower, but since it's only in csum mismatch path, it should
>   be fine.
> 
>   For sectors with csum error, we submit repair for each sector.
> 
> This patch will focus on the change on the repair path, the extra
> validation code is still kept as is, and will be cleaned up later.

This leaves btrfs_io_needs_validation unused and compiler warns about
that but it gets removed in the next patch so that's ok.

I did some minor style fixups

--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2706,7 +2706,7 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
        struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
 
        ASSERT(page_offset(page) <= start &&
-               start + len <= page_offset(page) + PAGE_SIZE);
+              start + len <= page_offset(page) + PAGE_SIZE);
 
        if (uptodate) {
                btrfs_page_set_uptodate(fs_info, page, start, len);
@@ -2734,7 +2734,7 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
 {
        struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
        const u32 sectorsize = fs_info->sectorsize;
-       int nr_bits = (end + 1 - start) / sectorsize;
+       const int nr_bits = (end + 1 - start) / fs_info->sectorsize_bits;
        int i;
 
        BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
@@ -2747,10 +2747,10 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
                int ret;
                unsigned int offset = i * sectorsize;
 
-               if (!(error_bitmap & (1 << i))) {
+               if (!(error_bitmap & (1U << i))) {
                        struct extent_state *cached = NULL;
 
-                       /* This sector has no error, just finish the read. */
+                       /* This sector has no error, just finish the read */
                        end_page_read(page, true, start + offset, sectorsize);
                        set_extent_uptodate(&BTRFS_I(inode)->io_tree,
                                        start + offset,
---

The division can be replaced by shift as we have it in fs_info and "1U"
in shifts is for clarity that it's performed on unsigned type.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector
  2021-05-03  2:08 [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-05-03  2:08 ` [PATCH v3 4/4] btrfs: remove io_failure_record::in_validation Qu Wenruo
@ 2021-05-10 20:41 ` David Sterba
  2021-05-11  1:07   ` Qu Wenruo
  4 siblings, 1 reply; 16+ messages in thread
From: David Sterba @ 2021-05-10 20:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 03, 2021 at 10:08:52AM +0800, Qu Wenruo wrote:
> Btrfs read time repair has to handle two different cases when a corruption
> or read failure is hit:
> - The failed bio contains only one sector
>   Then it only need to find a good copy
> 
> - The failed bio contains several sectors
>   Then it needs to find which sectors really need to be repaired
> 
> But this different behaviors are not really needed, as we can teach btrfs
> to only submit read repair for each corrupted sector.
> By this, we only need to handle the one-sector corruption case.
> 
> This not only makes the code smaller and simpler, but also benefits subpage,
> allow subpage case to use the same infrastructure.
> 
> For current subpage code, we hacked the read repair code to make full
> bvec read repair, which has less granularity compared to regular sector
> size.
> 
> The code is still based on subpage branch, but can be forward ported to
> non-subpage code basis with minor conflicts.
> 
> Changelog:
> v2:
> - Split the original patch
>   Now we have two preparation patches, then the core change.
>   And finally a cleanup.
> 
> - Fix the uninitialize @error_bitmap when the bio read fails.
> 
> v3:
> - Fix the return value type mismatch in repair_one_sector()
>   An error happens in v2 patch split, which can lead to hang when
>   we can't repair the error.

Patchset added to for-next. The cleanups and simplifications look good
to me, thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation()
  2021-05-10 20:14   ` David Sterba
@ 2021-05-10 23:56     ` Qu Wenruo
  2021-05-11 11:27       ` David Sterba
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2021-05-10 23:56 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 2021/5/11 上午4:14, David Sterba wrote:
> On Mon, May 03, 2021 at 10:08:53AM +0800, Qu Wenruo wrote:
>>   	/*
>> -	 * We need to validate each sector individually if the failed I/O was
>> -	 * for multiple sectors.
>> -	 *
>> -	 * There are a few possible bios that can end up here:
>> -	 * 1. A buffered read bio, which is not cloned.
>> -	 * 2. A direct I/O read bio, which is cloned.
>> -	 * 3. A (buffered or direct) repair bio, which is not cloned.
>> -	 *
>> -	 * For cloned bios (case 2), we can get the size from
>> -	 * btrfs_io_bio->iter; for non-cloned bios (cases 1 and 3), we can get
>> -	 * it from the bvecs.
>> +	 * We're ensured we won't get cloned bio in end_bio_extent_readpage(),
>> +	 * thus we can get the length from the bvecs.
>>   	 */
>> -	if (bio_flagged(bio, BIO_CLONED)) {
>> -		if (btrfs_io_bio(bio)->iter.bi_size > blocksize)
>> +	bio_for_each_bvec_all(bvec, bio, i) {
>> +		len += bvec->bv_len;
>> +		if (len > blocksize)
>>   			return true;
> 
> I've looked if the bio cloning is used at all so we could potentially
> get rid of all the BIO_CLONED assertions. There are still two cases:
> 
> * btrfs_submit_direct calling btrfs_bio_clone_partial

This is what I missed.

In fact for DIO read repair we can still hit a cloned bio.

But in that case, we still don't need any validation since the DIO read 
repair is ensured to only submit sector sized repair.
So the patchset is still fine, but will break the bisect.

> * btrfs_map_bio calling btrfs_bio_clone

This is not a problem, as the generated bio are for real device, not for 
the inode pages, and read repair only happens for inode pages, we're 
completely fine.

> 
> So in this case I'd rather add an assertion before
> bio_for_each_bvec_all, as this fits the usecase "this never happens".
> The original assertions were added everywhere once the bio iteration
> behaviour changed without much notice, so we need to be cautious.
> 
> Applied with the following fixup

Then bisect will be broken.

If one is testing just this patch, DIO read repair will trigger the 
ASSERT().

Is it possible to discard this patch and completely rely on the last 
patch to remove btrfs_io_needs_validation()?

Thanks,
Qu

> 
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2616,6 +2616,7 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
>           * We're ensured we won't get cloned bio in end_bio_extent_readpage(),
>           * thus we can get the length from the bvecs.
>           */
> +       ASSERT(!bio_flagged(bio, BIO_CLONED));
>          bio_for_each_bvec_all(bvec, bio, i) {
>                  len += bvec->bv_len;
>                  if (len > blocksize)
> ---
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector
  2021-05-10 20:41 ` [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector David Sterba
@ 2021-05-11  1:07   ` Qu Wenruo
  2021-05-11 11:59     ` David Sterba
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2021-05-11  1:07 UTC (permalink / raw)
  To: dsterba, linux-btrfs

On 2021/5/11 上午4:41, David Sterba wrote:
> On Mon, May 03, 2021 at 10:08:52AM +0800, Qu Wenruo wrote:
>> Btrfs read time repair has to handle two different cases when a corruption
>> or read failure is hit:
>> - The failed bio contains only one sector
>>    Then it only need to find a good copy
>>
>> - The failed bio contains several sectors
>>    Then it needs to find which sectors really need to be repaired
>>
>> But this different behaviors are not really needed, as we can teach btrfs
>> to only submit read repair for each corrupted sector.
>> By this, we only need to handle the one-sector corruption case.
>>
>> This not only makes the code smaller and simpler, but also benefits subpage,
>> allow subpage case to use the same infrastructure.
>>
>> For current subpage code, we hacked the read repair code to make full
>> bvec read repair, which has less granularity compared to regular sector
>> size.
>>
>> The code is still based on subpage branch, but can be forward ported to
>> non-subpage code basis with minor conflicts.
>>
>> Changelog:
>> v2:
>> - Split the original patch
>>    Now we have two preparation patches, then the core change.
>>    And finally a cleanup.
>>
>> - Fix the uninitialize @error_bitmap when the bio read fails.
>>
>> v3:
>> - Fix the return value type mismatch in repair_one_sector()
>>    An error happens in v2 patch split, which can lead to hang when
>>    we can't repair the error.
> 
> Patchset added to for-next. The cleanups and simplifications look good
> to me, thanks.
> 
I'm afraid there is a bug in the patchset.

If we had a data read for 16 sectors in one page, one sector is bad and 
can't be repaired, we will under flow subage::readers number.

The cause is there are two call sites calling end_page_read().

One in btrfs_submit_read_repair(), one in end_bio_extent_readpage().
The former one is just calling end_page_read() for the good copy, while 
the latter one is calling end_page_read() for the full range.

The direct fix is to make btrfs_submit_read_repair() to handle both 
cases, and skip the call in end_bio_extent_readpage().

So I need to update the patchset to include a proper fix for it.

But on the other hand, I'm also wondering should we use 
btrfs_subpage::readers as an atomic.
For a more idiot proof way, we can also go 16bit map for reader/writer 
accounting, by that even we call end_page_read() twice for the same 
range, it won't cause anything.

Any advice on btrfs_subpage::readers implementation?
Should it be really idiot (me) proof, or just current atomic way to 
catch more idiots like me?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/4] btrfs: submit read time repair only for each corrupted sector
  2021-05-10 20:32   ` David Sterba
@ 2021-05-11  1:42     ` Qu Wenruo
  2021-05-11 11:35       ` David Sterba
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2021-05-11  1:42 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/11 上午4:32, David Sterba wrote:
> On Mon, May 03, 2021 at 10:08:55AM +0800, Qu Wenruo wrote:
>> Currently btrfs_submit_read_repair() has some extra check on whether the
>> failed bio needs extra validation for repair.
>>
>> But we can avoid all these extra mechanism if we submit the repair for
>> each sector.
>>
>> By this, each read repair can be easily handled without the need to
>> verify which sector is corrupted.
>>
>> This will also benefit subpage, as one subpage bvec can contain several
>> sectors, making the extra verification more complex.
>>
>> So this patch will:
>> - Introduce repair_one_sector()
>>    The main code submitting repair, which is more or less the same as old
>>    btrfs_submit_read_repair().
>>    But this time, it only repair one sector.
>>
>> - Make btrfs_submit_read_repair() to handle sectors differently
>>    For sectors without csum error, just release them like what we did
>>    in end_bio_extent_readpage().
>>    Although in this context we don't have process_extent structure, thus
>>    we have to do extent tree operations sector by sector.
>>    This is slower, but since it's only in csum mismatch path, it should
>>    be fine.
>>
>>    For sectors with csum error, we submit repair for each sector.
>>
>> This patch will focus on the change on the repair path, the extra
>> validation code is still kept as is, and will be cleaned up later.
>
> This leaves btrfs_io_needs_validation unused and compiler warns about
> that but it gets removed in the next patch so that's ok.
>
> I did some minor style fixups
>
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2706,7 +2706,7 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
>          struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>
>          ASSERT(page_offset(page) <= start &&
> -               start + len <= page_offset(page) + PAGE_SIZE);
> +              start + len <= page_offset(page) + PAGE_SIZE);
>
>          if (uptodate) {
>                  btrfs_page_set_uptodate(fs_info, page, start, len);
> @@ -2734,7 +2734,7 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
>   {
>          struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>          const u32 sectorsize = fs_info->sectorsize;
> -       int nr_bits = (end + 1 - start) / sectorsize;
> +       const int nr_bits = (end + 1 - start) / fs_info->sectorsize_bits;

It should be >> fs_info->sectorsize_bits;

Anyway, I'll submit a proper updated version, with your update and
proper test.

Thanks,
Qu

>          int i;
>
>          BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
> @@ -2747,10 +2747,10 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
>                  int ret;
>                  unsigned int offset = i * sectorsize;
>
> -               if (!(error_bitmap & (1 << i))) {
> +               if (!(error_bitmap & (1U << i))) {
>                          struct extent_state *cached = NULL;
>
> -                       /* This sector has no error, just finish the read. */
> +                       /* This sector has no error, just finish the read */
>                          end_page_read(page, true, start + offset, sectorsize);
>                          set_extent_uptodate(&BTRFS_I(inode)->io_tree,
>                                          start + offset,
> ---
>
> The division can be replaced by shift as we have it in fs_info and "1U"
> in shifts is for clarity that it's performed on unsigned type.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation()
  2021-05-10 23:56     ` Qu Wenruo
@ 2021-05-11 11:27       ` David Sterba
  0 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2021-05-11 11:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs

On Tue, May 11, 2021 at 07:56:42AM +0800, Qu Wenruo wrote:
> >> -	if (bio_flagged(bio, BIO_CLONED)) {
> >> -		if (btrfs_io_bio(bio)->iter.bi_size > blocksize)
> >> +	bio_for_each_bvec_all(bvec, bio, i) {
> >> +		len += bvec->bv_len;
> >> +		if (len > blocksize)
> >>   			return true;
> > 
> > I've looked if the bio cloning is used at all so we could potentially
> > get rid of all the BIO_CLONED assertions. There are still two cases:
> > 
> > * btrfs_submit_direct calling btrfs_bio_clone_partial
> 
> This is what I missed.
> 
> In fact for DIO read repair we can still hit a cloned bio.
> 
> But in that case, we still don't need any validation since the DIO read 
> repair is ensured to only submit sector sized repair.
> So the patchset is still fine, but will break the bisect.
> 
> > * btrfs_map_bio calling btrfs_bio_clone
> 
> This is not a problem, as the generated bio are for real device, not for 
> the inode pages, and read repair only happens for inode pages, we're 
> completely fine.
> 
> > So in this case I'd rather add an assertion before
> > bio_for_each_bvec_all, as this fits the usecase "this never happens".
> > The original assertions were added everywhere once the bio iteration
> > behaviour changed without much notice, so we need to be cautious.
> > 
> > Applied with the following fixup
> 
> Then bisect will be broken.
> 
> If one is testing just this patch, DIO read repair will trigger the 
> ASSERT().
> 
> Is it possible to discard this patch and completely rely on the last 
> patch to remove btrfs_io_needs_validation()?

Yeah, we want to keep it bisectable.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/4] btrfs: submit read time repair only for each corrupted sector
  2021-05-11  1:42     ` Qu Wenruo
@ 2021-05-11 11:35       ` David Sterba
  0 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2021-05-11 11:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Tue, May 11, 2021 at 09:42:30AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/5/11 上午4:32, David Sterba wrote:
> > On Mon, May 03, 2021 at 10:08:55AM +0800, Qu Wenruo wrote:
> >> Currently btrfs_submit_read_repair() has some extra check on whether the
> >> failed bio needs extra validation for repair.
> >>
> >> But we can avoid all these extra mechanism if we submit the repair for
> >> each sector.
> >>
> >> By this, each read repair can be easily handled without the need to
> >> verify which sector is corrupted.
> >>
> >> This will also benefit subpage, as one subpage bvec can contain several
> >> sectors, making the extra verification more complex.
> >>
> >> So this patch will:
> >> - Introduce repair_one_sector()
> >>    The main code submitting repair, which is more or less the same as old
> >>    btrfs_submit_read_repair().
> >>    But this time, it only repair one sector.
> >>
> >> - Make btrfs_submit_read_repair() to handle sectors differently
> >>    For sectors without csum error, just release them like what we did
> >>    in end_bio_extent_readpage().
> >>    Although in this context we don't have process_extent structure, thus
> >>    we have to do extent tree operations sector by sector.
> >>    This is slower, but since it's only in csum mismatch path, it should
> >>    be fine.
> >>
> >>    For sectors with csum error, we submit repair for each sector.
> >>
> >> This patch will focus on the change on the repair path, the extra
> >> validation code is still kept as is, and will be cleaned up later.
> >
> > This leaves btrfs_io_needs_validation unused and compiler warns about
> > that but it gets removed in the next patch so that's ok.
> >
> > I did some minor style fixups
> >
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2706,7 +2706,7 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> >          struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> >
> >          ASSERT(page_offset(page) <= start &&
> > -               start + len <= page_offset(page) + PAGE_SIZE);
> > +              start + len <= page_offset(page) + PAGE_SIZE);
> >
> >          if (uptodate) {
> >                  btrfs_page_set_uptodate(fs_info, page, start, len);
> > @@ -2734,7 +2734,7 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
> >   {
> >          struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >          const u32 sectorsize = fs_info->sectorsize;
> > -       int nr_bits = (end + 1 - start) / sectorsize;
> > +       const int nr_bits = (end + 1 - start) / fs_info->sectorsize_bits;
> 
> It should be >> fs_info->sectorsize_bits;
> 
> Anyway, I'll submit a proper updated version, with your update and
> proper test.

My bad, thanks for catching it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector
  2021-05-11  1:07   ` Qu Wenruo
@ 2021-05-11 11:59     ` David Sterba
  0 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2021-05-11 11:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs

On Tue, May 11, 2021 at 09:07:26AM +0800, Qu Wenruo wrote:
> I'm afraid there is a bug in the patchset.
> 
> If we had a data read for 16 sectors in one page, one sector is bad and 
> can't be repaired, we will under flow subage::readers number.
> 
> The cause is there are two call sites calling end_page_read().
> 
> One in btrfs_submit_read_repair(), one in end_bio_extent_readpage().
> The former one is just calling end_page_read() for the good copy, while 
> the latter one is calling end_page_read() for the full range.
> 
> The direct fix is to make btrfs_submit_read_repair() to handle both 
> cases, and skip the call in end_bio_extent_readpage().
> 
> So I need to update the patchset to include a proper fix for it.
> 
> But on the other hand, I'm also wondering should we use 
> btrfs_subpage::readers as an atomic.
> For a more idiot proof way, we can also go 16bit map for reader/writer 
> accounting, by that even we call end_page_read() twice for the same 
> range, it won't cause anything.
> 
> Any advice on btrfs_subpage::readers implementation?

At this point do what you think would work safely even if the
performance would not be great, eg. using a spinlock around the bitmap.
We'll have to optimize all the bitmaps anyway but not before the
subpage support is finished.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-05-11 12:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-03  2:08 [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
2021-05-03  2:08 ` [PATCH v3 1/4] btrfs: remove the dead branch in btrfs_io_needs_validation() Qu Wenruo
2021-05-03 17:05   ` David Sterba
2021-05-03 23:39     ` Qu Wenruo
2021-05-10 20:14   ` David Sterba
2021-05-10 23:56     ` Qu Wenruo
2021-05-11 11:27       ` David Sterba
2021-05-03  2:08 ` [PATCH v3 2/4] btrfs: make btrfs_verify_data_csum() to return a bitmap Qu Wenruo
2021-05-03  2:08 ` [PATCH v3 3/4] btrfs: submit read time repair only for each corrupted sector Qu Wenruo
2021-05-10 20:32   ` David Sterba
2021-05-11  1:42     ` Qu Wenruo
2021-05-11 11:35       ` David Sterba
2021-05-03  2:08 ` [PATCH v3 4/4] btrfs: remove io_failure_record::in_validation Qu Wenruo
2021-05-10 20:41 ` [PATCH v3 0/4] btrfs: make read time repair to be only submitted for each corrupted sector David Sterba
2021-05-11  1:07   ` Qu Wenruo
2021-05-11 11:59     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.