All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector
@ 2021-05-12  4:53 Qu Wenruo
  2021-05-12  4:53 ` [PATCH v5 1/3] btrfs: make btrfs_verify_data_csum() to return a bitmap Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Qu Wenruo @ 2021-05-12  4:53 UTC (permalink / raw)
  To: linux-btrfs

Btrfs read time repair has to handle two different cases when a corruption
or read failure is hit:
- The failed bio contains only one sector
  Then it only need to find a good copy

- The failed bio contains several sectors
  Then it needs to find which sectors really need to be repaired

But this different behaviors are not really needed, as we can teach btrfs
to only submit read repair for each corrupted sector.
By this, we only need to handle the one-sector corruption case.

This not only makes the code smaller and simpler, but also benefits subpage,
allow subpage case to use the same infrastructure without any extra
modification.

For current subpage code, we hacked the read repair code to make full
bvec read repair, which has less granularity compared to regular sector
size.

The code is still based on subpage branch, but can be forward ported to
non-subpage code basis with minor conflicts.

Changelog:
v2:
- Split the original patch
  Now we have two preparation patches, then the core change.
  And finally a cleanup.

- Fix the uninitialize @error_bitmap when the bio read fails.

v3:
- Fix the return value type mismatch in repair_one_sector()
  An error happens in v2 patch split, which can lead to hang when
  we can't repair the error.

v4:
- Fix a bug that end_page_read() get called twice for the same range
  This happens when the corrupted sector has no extra copy, thus
  btrfs_submit_read_repair() return -EIO, leaving both
  btrfs_submit_read_repair() and end_bio_extent_readpage() to
  call end_page_read() twice on the good copy.
  Thankfully this only affects subpage.

- Fix a bug that sectors after unrepairable corruption are not released
  Since btrfs_submit_read_repair() is responsible for the page release,
  we can no longer just error out.
  Or some ordered extent will not be able to finish.

- Remove patch "btrfs: remove the dead branch in btrfs_io_needs_validation()"
  The cleanup will break bisect, as DIO can still generate cloned bio.
  Thus remove it and let the final cleanup patch to handle everything.

- Apply the style fixes from David

v5:
- Fix a bug where we grab wrong fs_info from DIO page
  Exposed by btrfs/215.
  And for DIO case we don't need end_page_read() and extent unrelease
  call at all.

- Unexport btrfs_submit_read_repair(), export btrfs_repair_one_sector()
  Since DIO only needs to repair one sector, unexport
  btrfs_submit_read_repair() and just export btrfs_repair_one_sector().

Qu Wenruo (3):
  btrfs: make btrfs_verify_data_csum() to return a bitmap
  btrfs: submit read time repair only for each corrupted sector
  btrfs: remove io_failure_record::in_validation

 fs/btrfs/ctree.h     |   4 +-
 fs/btrfs/extent_io.c | 348 +++++++++++++++++++++----------------------
 fs/btrfs/extent_io.h |  13 +-
 fs/btrfs/inode.c     |  27 ++--
 4 files changed, 191 insertions(+), 201 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v5 1/3] btrfs: make btrfs_verify_data_csum() to return a bitmap
  2021-05-12  4:53 [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
@ 2021-05-12  4:53 ` Qu Wenruo
  2021-05-12  4:53 ` [PATCH v5 2/3] btrfs: submit read time repair only for each corrupted sector Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2021-05-12  4:53 UTC (permalink / raw)
  To: linux-btrfs

This will provide the basis for later per-sector repair for subpage,
while still keep the existing code happy.

As if all csum matches, the return value is still 0.
Only when csum mismatches, the return value is different.

The new return value will be a bitmap, for 4K sectorsize and 4K page
size, it will be either 1, instead of the old -EIO.

But for 4K sectorsize and 64K page size, aka subpage case, since the
bvec can contain multiple sectors, knowing which sectors are corrupted
will allow us to submit repair only for corrupted sectors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |  4 ++--
 fs/btrfs/inode.c | 18 +++++++++++++-----
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 80670a631714..7bb4212b90d3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3100,8 +3100,8 @@ u64 btrfs_file_extent_end(const struct btrfs_path *path);
 /* inode.c */
 blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 				   int mirror_num, unsigned long bio_flags);
-int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
-			   struct page *page, u64 start, u64 end);
+unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
+				    struct page *page, u64 start, u64 end);
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
 					   u64 start, u64 len);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cb26bb246d13..1c019b1dc114 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3135,15 +3135,19 @@ static int check_data_csum(struct inode *inode, struct btrfs_io_bio *io_bio,
  * @bio_offset:	offset to the beginning of the bio (in bytes)
  * @start:	file offset of the range start
  * @end:	file offset of the range end (inclusive)
+ *
+ * Return a bitmap where bit set means a csum mismatch, and bit not set means
+ * csum match.
  */
-int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
-			   struct page *page, u64 start, u64 end)
+unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
+				    struct page *page, u64 start, u64 end)
 {
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	const u32 sectorsize = root->fs_info->sectorsize;
 	u32 pg_off;
+	unsigned int result = 0;
 
 	if (PageChecked(page)) {
 		ClearPageChecked(page);
@@ -3176,10 +3180,14 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
 		}
 		ret = check_data_csum(inode, io_bio, bio_offset, page, pg_off,
 				      page_offset(page) + pg_off);
-		if (ret < 0)
-			return -EIO;
+		if (ret < 0) {
+			const int nr_bit = (pg_off - offset_in_page(start)) >>
+				     root->fs_info->sectorsize_bits;
+
+			result |= (1U << nr_bit);
+		}
 	}
-	return 0;
+	return result;
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v5 2/3] btrfs: submit read time repair only for each corrupted sector
  2021-05-12  4:53 [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
  2021-05-12  4:53 ` [PATCH v5 1/3] btrfs: make btrfs_verify_data_csum() to return a bitmap Qu Wenruo
@ 2021-05-12  4:53 ` Qu Wenruo
  2021-06-07  8:27   ` Qu Wenruo
  2021-05-12  4:53 ` [PATCH v5 3/3] btrfs: remove io_failure_record::in_validation Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2021-05-12  4:53 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair.

But we can avoid all these extra mechanism if we submit the repair for
each sector.

By this, each read repair can be easily handled without the need to
verify which sector is corrupted.

This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.

So this patch will:
- Introduce repair_one_sector()
  The main code submitting repair, which is more or less the same as old
  btrfs_submit_read_repair().
  But this time, it only repairs one sector.

- Make btrfs_submit_read_repair() to handle sectors differently
  There are 3 different cases:
  * Good sector
    We need to release the page and extent, set the range uptodate.

  * Bad sector and failed to submit repair bio
    We need to release the page and extent, but not set the range
    uptodate.

  * Bad sector but repair bio submitted
    The page and extent release will be handled by the submitted repair
    bio. Nothing needs to be done.

  Since btrfs_submit_read_repair() will handle the page and extent
  release now, we need to skip to next bvec even we hit some error.

- Change the lifespan of @uptodate in end_bio_extent_readpage()
  Since now btrfs_submit_read_repair() will handle the full bvec
  which contains any corruption, we don't need to bother updating
  @uptodate bit anymore.
  Just let @uptodate to be local variable inside the main loop,
  so that any error from one bvec won't affect later bvec.

- Only export btrfs_repair_one_sector(), unexport
  btrfs_submit_read_repair()
  The only outside caller for read repair is DIO, which already submit
  its repair for just one sector.
  Only export btrfs_repair_one_sector() for DIO.

This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 233 ++++++++++++++++++++++++++++---------------
 fs/btrfs/extent_io.h |  10 +-
 fs/btrfs/inode.c     |   9 +-
 3 files changed, 164 insertions(+), 88 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8b861227daef..85719947fa31 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2494,7 +2494,7 @@ void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
 }
 
 static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
-							     u64 start, u64 end)
+							     u64 start)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct io_failure_record *failrec;
@@ -2502,6 +2502,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	const u32 sectorsize = fs_info->sectorsize;
 	int ret;
 	u64 logical;
 
@@ -2525,7 +2526,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 		return ERR_PTR(-ENOMEM);
 
 	failrec->start = start;
-	failrec->len = end - start + 1;
+	failrec->len = sectorsize;
 	failrec->this_mirror = 0;
 	failrec->bio_flags = 0;
 	failrec->in_validation = 0;
@@ -2564,12 +2565,13 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	free_extent_map(em);
 
 	/* Set the bits in the private failure tree */
-	ret = set_extent_bits(failure_tree, start, end,
+	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
 			      EXTENT_LOCKED | EXTENT_DIRTY);
 	if (ret >= 0) {
 		ret = set_state_failrec(failure_tree, start, failrec);
 		/* Set the bits in the inode's tree */
-		ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED);
+		ret = set_extent_bits(tree, start, start + sectorsize - 1,
+				      EXTENT_DAMAGED);
 	} else if (ret < 0) {
 		kfree(failrec);
 		return ERR_PTR(ret);
@@ -2697,11 +2699,11 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
 	return false;
 }
 
-blk_status_t btrfs_submit_read_repair(struct inode *inode,
-				      struct bio *failed_bio, u32 bio_offset,
-				      struct page *page, unsigned int pgoff,
-				      u64 start, u64 end, int failed_mirror,
-				      submit_bio_hook_t *submit_bio_hook)
+int btrfs_repair_one_sector(struct inode *inode,
+			    struct bio *failed_bio, u32 bio_offset,
+			    struct page *page, unsigned int pgoff,
+			    u64 start, int failed_mirror,
+			    submit_bio_hook_t *submit_bio_hook)
 {
 	struct io_failure_record *failrec;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2719,16 +2721,22 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
 
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
-	failrec = btrfs_get_io_failure_record(inode, start, end);
+	failrec = btrfs_get_io_failure_record(inode, start);
 	if (IS_ERR(failrec))
-		return errno_to_blk_status(PTR_ERR(failrec));
-
-	need_validation = btrfs_io_needs_validation(inode, failed_bio);
+		return PTR_ERR(failrec);
 
+	/*
+	 * We will only submit repair for one sector, thus we don't need
+	 * extra validation anymore.
+	 *
+	 * TODO: All those extra validation related code will be cleaned up
+	 * later.
+	 */
+	need_validation = false;
 	if (!btrfs_check_repairable(inode, need_validation, failrec,
 				    failed_mirror)) {
 		free_io_failure(failure_tree, tree, failrec);
-		return BLK_STS_IOERR;
+		return -EIO;
 	}
 
 	repair_bio = btrfs_io_bio_alloc(1);
@@ -2762,7 +2770,120 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
 		free_io_failure(failure_tree, tree, failrec);
 		bio_put(repair_bio);
 	}
-	return status;
+	return blk_status_to_errno(status);
+}
+
+static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+
+	ASSERT(page_offset(page) <= start &&
+	       start + len <= page_offset(page) + PAGE_SIZE);
+
+	/*
+	 * For subapge metadata case, all btrfs_page_* helpers needs page to
+	 * have page::private populated.
+	 * But we can have rare case where the last eb in the page is only
+	 * referred by the IO, and it get released immedately after it's
+	 * read and verified.
+	 *
+	 * This can detach the page private completely.
+	 * In that case, we can just skip the page status update completely,
+	 * as the page has no eb any more.
+	 */
+	if (fs_info->sectorsize < PAGE_SIZE && unlikely(!PagePrivate(page))) {
+		ASSERT(!is_data_inode(page->mapping->host));
+		return;
+	}
+	if (uptodate) {
+		btrfs_page_set_uptodate(fs_info, page, start, len);
+	} else {
+		btrfs_page_clear_uptodate(fs_info, page, start, len);
+		btrfs_page_set_error(fs_info, page, start, len);
+	}
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		unlock_page(page);
+	else if (is_data_inode(page->mapping->host))
+		/*
+		 * For subpage data, unlock the page if we're the last reader.
+		 * For subpage metadata, page lock is not utilized for read.
+		 */
+		btrfs_subpage_end_reader(fs_info, page, start, len);
+}
+
+static blk_status_t submit_read_repair(struct inode *inode,
+				      struct bio *failed_bio, u32 bio_offset,
+				      struct page *page, unsigned int pgoff,
+				      u64 start, u64 end, int failed_mirror,
+				      unsigned int error_bitmap,
+				      submit_bio_hook_t *submit_bio_hook)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	const u32 sectorsize = fs_info->sectorsize;
+	int error = 0;
+	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
+	int i;
+
+	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
+
+	/* We're here because we had some read errors or csum mismatch */
+	ASSERT(error_bitmap);
+
+	/*
+	 * We only get called on buffered IO, thus page must be mapped and bio
+	 * must not be cloned.
+	 */
+	ASSERT(page->mapping && !bio_flagged(failed_bio, BIO_CLONED));
+
+	/* Iterate through all the sectors in the range */
+	for (i = 0; i < nr_bits; i++) {
+		const unsigned int offset = i * sectorsize;
+		struct extent_state *cached = NULL;
+		bool uptodate = false;
+		int ret;
+
+		if (!(error_bitmap & (1U << i))) {
+			/*
+			 * This sector has no error, just end the page read
+			 * and unlock the range.
+			 */
+			uptodate = true;
+			goto next;
+		}
+
+		ret = btrfs_repair_one_sector(inode, failed_bio,
+				bio_offset + offset,
+				page, pgoff + offset, start + offset,
+				failed_mirror, submit_bio_hook);
+		if (!ret) {
+			/*
+			 * We have submitted the read repair, the page release
+			 * will be handled by the endio function of the
+			 * submitted repair bio.
+			 * Thus we don't need to do any thing here.
+			 */
+			continue;
+		}
+		/*
+		 * Repair failed, just record the error but still continue.
+		 * Or the remaining sectors will not be properly unlocked.
+		 */
+		if (!error)
+			error = ret;
+next:
+		end_page_read(page, uptodate, start + offset, sectorsize);
+		if (uptodate)
+			set_extent_uptodate(&BTRFS_I(inode)->io_tree,
+					start + offset,
+					start + offset + sectorsize - 1,
+					&cached, GFP_ATOMIC);
+		unlock_extent_cached_atomic(&BTRFS_I(inode)->io_tree,
+				start + offset,
+				start + offset + sectorsize - 1,
+				&cached);
+	}
+	return errno_to_blk_status(error);
 }
 
 /* lots and lots of room for performance fixes in the end_bio funcs */
@@ -2919,45 +3040,6 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
 	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
 }
 
-static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
-
-	ASSERT(page_offset(page) <= start &&
-		start + len <= page_offset(page) + PAGE_SIZE);
-
-	/*
-	 * For subapge metadata case, all btrfs_page_* helpers needs page to
-	 * have page::private populate.
-	 * But we can have rare case where the last eb in the page is only
-	 * referred by the IO, and it get released immedately after it's
-	 * read and verified.
-	 *
-	 * This can detach the page private completely.
-	 * In that case, we can just skip the page status update completely,
-	 * as the page has no eb any more.
-	 */
-	if (fs_info->sectorsize < PAGE_SIZE && unlikely(!PagePrivate(page))) {
-		ASSERT(!is_data_inode(page->mapping->host));
-		return;
-	}
-	if (uptodate) {
-		btrfs_page_set_uptodate(fs_info, page, start, len);
-	} else {
-		btrfs_page_clear_uptodate(fs_info, page, start, len);
-		btrfs_page_set_error(fs_info, page, start, len);
-	}
-
-	if (fs_info->sectorsize == PAGE_SIZE)
-		unlock_page(page);
-	else if (is_data_inode(page->mapping->host))
-		/*
-		 * For subpage data, unlock the page if we're the last reader.
-		 * For subpage metadata, page lock is not utilized for read.
-		 */
-		btrfs_subpage_end_reader(fs_info, page, start, len);
-}
-
 /*
  * Find extent buffer for a givne bytenr.
  *
@@ -3001,7 +3083,6 @@ static struct extent_buffer *find_extent_buffer_readpage(
 static void end_bio_extent_readpage(struct bio *bio)
 {
 	struct bio_vec *bvec;
-	int uptodate = !bio->bi_status;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 	struct extent_io_tree *tree, *failure_tree;
 	struct processed_extent processed = { 0 };
@@ -3016,10 +3097,12 @@ static void end_bio_extent_readpage(struct bio *bio)
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
+		bool uptodate = !bio->bi_status;
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
+		unsigned int error_bitmap = (unsigned int)-1;
 		u64 start;
 		u64 end;
 		u32 len;
@@ -3054,14 +3137,16 @@ static void end_bio_extent_readpage(struct bio *bio)
 
 		mirror = io_bio->mirror_num;
 		if (likely(uptodate)) {
-			if (is_data_inode(inode))
-				ret = btrfs_verify_data_csum(io_bio,
+			if (is_data_inode(inode)) {
+				error_bitmap = btrfs_verify_data_csum(io_bio,
 						bio_offset, page, start, end);
-			else
+				ret = error_bitmap;
+			} else {
 				ret = btrfs_validate_metadata_buffer(io_bio,
 					page, start, end, mirror);
+			}
 			if (ret)
-				uptodate = 0;
+				uptodate = false;
 			else
 				clean_io_failure(BTRFS_I(inode)->root->fs_info,
 						 failure_tree, tree, start,
@@ -3073,27 +3158,19 @@ static void end_bio_extent_readpage(struct bio *bio)
 			goto readpage_ok;
 
 		if (is_data_inode(inode)) {
-
 			/*
-			 * The generic bio_readpage_error handles errors the
-			 * following way: If possible, new read requests are
-			 * created and submitted and will end up in
-			 * end_bio_extent_readpage as well (if we're lucky,
-			 * not in the !uptodate case). In that case it returns
-			 * 0 and we just go on with the next page in our bio.
-			 * If it can't handle the error it will return -EIO and
-			 * we remain responsible for that page.
+			 * btrfs_submit_read_repair() will handle all the
+			 * good and bad sectors, we just continue to next
+			 * bvec.
 			 */
-			if (!btrfs_submit_read_repair(inode, bio, bio_offset,
-						page,
-						start - page_offset(page),
-						start, end, mirror,
-						btrfs_submit_data_bio)) {
-				uptodate = !bio->bi_status;
-				ASSERT(bio_offset + len > bio_offset);
-				bio_offset += len;
-				continue;
-			}
+			submit_read_repair(inode, bio, bio_offset, page,
+					   start - page_offset(page), start,
+					   end, mirror, error_bitmap,
+					   btrfs_submit_data_bio);
+
+			ASSERT(bio_offset + len > bio_offset);
+			bio_offset += len;
+			continue;
 		} else {
 			struct extent_buffer *eb;
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 1d7bc27719da..c49ce96c8b5d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -308,11 +308,11 @@ struct io_failure_record {
 };
 
 
-blk_status_t btrfs_submit_read_repair(struct inode *inode,
-				      struct bio *failed_bio, u32 bio_offset,
-				      struct page *page, unsigned int pgoff,
-				      u64 start, u64 end, int failed_mirror,
-				      submit_bio_hook_t *submit_bio_hook);
+int btrfs_repair_one_sector(struct inode *inode,
+			    struct bio *failed_bio, u32 bio_offset,
+			    struct page *page, unsigned int pgoff,
+			    u64 start, int failed_mirror,
+			    submit_bio_hook_t *submit_bio_hook);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1c019b1dc114..6e118b855239 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7938,19 +7938,18 @@ static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
 						 btrfs_ino(BTRFS_I(inode)),
 						 pgoff);
 			} else {
-				blk_status_t status;
+				int ret;
 
 				ASSERT((start - io_bio->logical) < UINT_MAX);
-				status = btrfs_submit_read_repair(inode,
+				ret = btrfs_repair_one_sector(inode,
 							&io_bio->bio,
 							start - io_bio->logical,
 							bvec.bv_page, pgoff,
 							start,
-							start + sectorsize - 1,
 							io_bio->mirror_num,
 							submit_dio_repair_bio);
-				if (status)
-					err = status;
+				if (ret)
+					err = errno_to_blk_status(ret);
 			}
 			start += sectorsize;
 			ASSERT(bio_offset + sectorsize > bio_offset);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v5 3/3] btrfs: remove io_failure_record::in_validation
  2021-05-12  4:53 [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
  2021-05-12  4:53 ` [PATCH v5 1/3] btrfs: make btrfs_verify_data_csum() to return a bitmap Qu Wenruo
  2021-05-12  4:53 ` [PATCH v5 2/3] btrfs: submit read time repair only for each corrupted sector Qu Wenruo
@ 2021-05-12  4:53 ` Qu Wenruo
  2021-05-12 13:57 ` [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector David Sterba
  2021-05-13 16:45 ` David Sterba
  4 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2021-05-12  4:53 UTC (permalink / raw)
  To: linux-btrfs

The io_failure_record::in_validation was introduced to handle failed bio
which cross several sectors.
In such case, we still need to verify which sectors are corrupted.

But since we're changed the way how we handle corrupted sectors, by only
submitting repair for each corrupted sector, there is no need for extra
validation any more.

This patch will cleanup all io_failure_record::in_validation related
code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 131 +++++++------------------------------------
 fs/btrfs/extent_io.h |   3 +-
 2 files changed, 20 insertions(+), 114 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 85719947fa31..ec4ca91861d8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2426,13 +2426,6 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 
 	BUG_ON(!failrec->this_mirror);
 
-	if (failrec->in_validation) {
-		/* there was no real error, just free the record */
-		btrfs_debug(fs_info,
-			"clean_io_failure: freeing dummy error at %llu",
-			failrec->start);
-		goto out;
-	}
 	if (sb_rdonly(fs_info->sb))
 		goto out;
 
@@ -2509,9 +2502,8 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	failrec = get_state_failrec(failure_tree, start);
 	if (!IS_ERR(failrec)) {
 		btrfs_debug(fs_info,
-			"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d",
-			failrec->logical, failrec->start, failrec->len,
-			failrec->in_validation);
+	"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu",
+			failrec->logical, failrec->start, failrec->len);
 		/*
 		 * when data can be on disk more than twice, add to failrec here
 		 * (e.g. with a list for failed_mirror) to make
@@ -2529,7 +2521,6 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	failrec->len = sectorsize;
 	failrec->this_mirror = 0;
 	failrec->bio_flags = 0;
-	failrec->in_validation = 0;
 
 	read_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, start, failrec->len);
@@ -2580,7 +2571,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	return failrec;
 }
 
-static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
+static bool btrfs_check_repairable(struct inode *inode,
 				   struct io_failure_record *failrec,
 				   int failed_mirror)
 {
@@ -2600,39 +2591,22 @@ static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
 		return false;
 	}
 
+	/* The failure record should only contain one sector */
+	ASSERT(failrec->len == fs_info->sectorsize);
+
 	/*
-	 * there are two premises:
-	 *	a) deliver good data to the caller
-	 *	b) correct the bad sectors on disk
+	 * There are two premises:
+	 * a) deliver good data to the caller
+	 * b) correct the bad sectors on disk
+	 *
+	 * Since we're only doing repair for one sector, we only need to get
+	 * a good copy of the failed sector and if we succeed, we have setup
+	 * everything for repair_io_failure to do the rest for us.
 	 */
-	if (needs_validation) {
-		/*
-		 * to fulfill b), we need to know the exact failing sectors, as
-		 * we don't want to rewrite any more than the failed ones. thus,
-		 * we need separate read requests for the failed bio
-		 *
-		 * if the following BUG_ON triggers, our validation request got
-		 * merged. we need separate requests for our algorithm to work.
-		 */
-		BUG_ON(failrec->in_validation);
-		failrec->in_validation = 1;
-		failrec->this_mirror = failed_mirror;
-	} else {
-		/*
-		 * we're ready to fulfill a) and b) alongside. get a good copy
-		 * of the failed sector and if we succeed, we have setup
-		 * everything for repair_io_failure to do the rest for us.
-		 */
-		if (failrec->in_validation) {
-			BUG_ON(failrec->this_mirror != failed_mirror);
-			failrec->in_validation = 0;
-			failrec->this_mirror = 0;
-		}
-		failrec->failed_mirror = failed_mirror;
+	failrec->failed_mirror = failed_mirror;
+	failrec->this_mirror++;
+	if (failrec->this_mirror == failed_mirror)
 		failrec->this_mirror++;
-		if (failrec->this_mirror == failed_mirror)
-			failrec->this_mirror++;
-	}
 
 	if (failrec->this_mirror > num_copies) {
 		btrfs_debug(fs_info,
@@ -2644,61 +2618,6 @@ static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
 	return true;
 }
 
-static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
-{
-	u64 len = 0;
-	const u32 blocksize = inode->i_sb->s_blocksize;
-
-	/*
-	 * If bi_status is BLK_STS_OK, then this was a checksum error, not an
-	 * I/O error. In this case, we already know exactly which sector was
-	 * bad, so we don't need to validate.
-	 */
-	if (bio->bi_status == BLK_STS_OK)
-		return false;
-
-	/*
-	 * For subpage case, read bio are always submitted as multiple-sector
-	 * bio if the range is in the same page.
-	 * For now, let's just skip the validation, and do page sized repair.
-	 *
-	 * This reduce the granularity for repair, meaning if we have two
-	 * copies with different csum mismatch at different location, we're
-	 * unable to repair in subpage case.
-	 *
-	 * TODO: Make validation code to be fully subpage compatible
-	 */
-	if (blocksize < PAGE_SIZE)
-		return false;
-	/*
-	 * We need to validate each sector individually if the failed I/O was
-	 * for multiple sectors.
-	 *
-	 * There are a few possible bios that can end up here:
-	 * 1. A buffered read bio, which is not cloned.
-	 * 2. A direct I/O read bio, which is cloned.
-	 * 3. A (buffered or direct) repair bio, which is not cloned.
-	 *
-	 * For cloned bios (case 2), we can get the size from
-	 * btrfs_io_bio->iter; for non-cloned bios (cases 1 and 3), we can get
-	 * it from the bvecs.
-	 */
-	if (bio_flagged(bio, BIO_CLONED)) {
-		if (btrfs_io_bio(bio)->iter.bi_size > blocksize)
-			return true;
-	} else {
-		struct bio_vec *bvec;
-		int i;
-
-		bio_for_each_bvec_all(bvec, bio, i) {
-			len += bvec->bv_len;
-			if (len > blocksize)
-				return true;
-		}
-	}
-	return false;
-}
-
 int btrfs_repair_one_sector(struct inode *inode,
 			    struct bio *failed_bio, u32 bio_offset,
 			    struct page *page, unsigned int pgoff,
@@ -2711,7 +2630,6 @@ int btrfs_repair_one_sector(struct inode *inode,
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct btrfs_io_bio *failed_io_bio = btrfs_io_bio(failed_bio);
 	const int icsum = bio_offset >> fs_info->sectorsize_bits;
-	bool need_validation;
 	struct bio *repair_bio;
 	struct btrfs_io_bio *repair_io_bio;
 	blk_status_t status;
@@ -2725,16 +2643,7 @@ int btrfs_repair_one_sector(struct inode *inode,
 	if (IS_ERR(failrec))
 		return PTR_ERR(failrec);
 
-	/*
-	 * We will only submit repair for one sector, thus we don't need
-	 * extra validation anymore.
-	 *
-	 * TODO: All those extra validation related code will be cleaned up
-	 * later.
-	 */
-	need_validation = false;
-	if (!btrfs_check_repairable(inode, need_validation, failrec,
-				    failed_mirror)) {
+	if (!btrfs_check_repairable(inode, failrec, failed_mirror)) {
 		free_io_failure(failure_tree, tree, failrec);
 		return -EIO;
 	}
@@ -2742,8 +2651,6 @@ int btrfs_repair_one_sector(struct inode *inode,
 	repair_bio = btrfs_io_bio_alloc(1);
 	repair_io_bio = btrfs_io_bio(repair_bio);
 	repair_bio->bi_opf = REQ_OP_READ;
-	if (need_validation)
-		repair_bio->bi_opf |= REQ_FAILFAST_DEV;
 	repair_bio->bi_end_io = failed_bio->bi_end_io;
 	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
 	repair_bio->bi_private = failed_bio->bi_private;
@@ -2761,8 +2668,8 @@ int btrfs_repair_one_sector(struct inode *inode,
 	repair_io_bio->iter = repair_bio->bi_iter;
 
 	btrfs_debug(btrfs_sb(inode->i_sb),
-"repair read error: submitting new read to mirror %d, in_validation=%d",
-		    failrec->this_mirror, failrec->in_validation);
+		    "repair read error: submitting new read to mirror %d",
+		    failrec->this_mirror);
 
 	status = submit_bio_hook(inode, repair_bio, failrec->this_mirror,
 				 failrec->bio_flags);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c49ce96c8b5d..d5c324a7a3d8 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -292,7 +292,7 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
  * When IO fails, either with EIO or csum verification fails, we
  * try other mirrors that might have a good copy of the data.  This
  * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the page is set up to date
+ * mirrors.  If another mirror has good data, the sector is set up to date
  * and things continue.  If a good mirror can't be found, the original
  * bio end_io callback is called to indicate things have failed.
  */
@@ -304,7 +304,6 @@ struct io_failure_record {
 	unsigned long bio_flags;
 	int this_mirror;
 	int failed_mirror;
-	int in_validation;
 };
 
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector
  2021-05-12  4:53 [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-05-12  4:53 ` [PATCH v5 3/3] btrfs: remove io_failure_record::in_validation Qu Wenruo
@ 2021-05-12 13:57 ` David Sterba
  2021-05-13 16:45 ` David Sterba
  4 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2021-05-12 13:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 12, 2021 at 12:53:27PM +0800, Qu Wenruo wrote:
> v5:
> - Fix a bug where we grab wrong fs_info from DIO page
>   Exposed by btrfs/215.
>   And for DIO case we don't need end_page_read() and extent unrelease
>   call at all.
> 
> - Unexport btrfs_submit_read_repair(), export btrfs_repair_one_sector()
>   Since DIO only needs to repair one sector, unexport
>   btrfs_submit_read_repair() and just export btrfs_repair_one_sector().

Replaced in for-next, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector
  2021-05-12  4:53 [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-05-12 13:57 ` [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector David Sterba
@ 2021-05-13 16:45 ` David Sterba
  4 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2021-05-13 16:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 12, 2021 at 12:53:27PM +0800, Qu Wenruo wrote:
> v5:
> - Fix a bug where we grab wrong fs_info from DIO page
>   Exposed by btrfs/215.
>   And for DIO case we don't need end_page_read() and extent unrelease
>   call at all.
> 
> - Unexport btrfs_submit_read_repair(), export btrfs_repair_one_sector()
>   Since DIO only needs to repair one sector, unexport
>   btrfs_submit_read_repair() and just export btrfs_repair_one_sector().

Seems to work so I'm moving it to misc-next. If you have any fixups
please send them as incremetal changes, I'll fold it in. Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/3] btrfs: submit read time repair only for each corrupted sector
  2021-05-12  4:53 ` [PATCH v5 2/3] btrfs: submit read time repair only for each corrupted sector Qu Wenruo
@ 2021-06-07  8:27   ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2021-06-07  8:27 UTC (permalink / raw)
  To: linux-btrfs



On 2021/5/12 下午12:53, Qu Wenruo wrote:
> Currently btrfs_submit_read_repair() has some extra check on whether the
> failed bio needs extra validation for repair.
> 
> But we can avoid all these extra mechanism if we submit the repair for
> each sector.
> 
> By this, each read repair can be easily handled without the need to
> verify which sector is corrupted.
> 
> This will also benefit subpage, as one subpage bvec can contain several
> sectors, making the extra verification more complex.
> 
> So this patch will:
> - Introduce repair_one_sector()
>    The main code submitting repair, which is more or less the same as old
>    btrfs_submit_read_repair().
>    But this time, it only repairs one sector.
> 
> - Make btrfs_submit_read_repair() to handle sectors differently
>    There are 3 different cases:
>    * Good sector
>      We need to release the page and extent, set the range uptodate.
> 
>    * Bad sector and failed to submit repair bio
>      We need to release the page and extent, but not set the range
>      uptodate.
> 
>    * Bad sector but repair bio submitted
>      The page and extent release will be handled by the submitted repair
>      bio. Nothing needs to be done.
> 
>    Since btrfs_submit_read_repair() will handle the page and extent
>    release now, we need to skip to next bvec even we hit some error.
> 
> - Change the lifespan of @uptodate in end_bio_extent_readpage()
>    Since now btrfs_submit_read_repair() will handle the full bvec
>    which contains any corruption, we don't need to bother updating
>    @uptodate bit anymore.
>    Just let @uptodate to be local variable inside the main loop,
>    so that any error from one bvec won't affect later bvec.
> 
> - Only export btrfs_repair_one_sector(), unexport
>    btrfs_submit_read_repair()
>    The only outside caller for read repair is DIO, which already submit
>    its repair for just one sector.
>    Only export btrfs_repair_one_sector() for DIO.
> 
> This patch will focus on the change on the repair path, the extra
> validation code is still kept as is, and will be cleaned up later.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 233 ++++++++++++++++++++++++++++---------------
>   fs/btrfs/extent_io.h |  10 +-
>   fs/btrfs/inode.c     |   9 +-
>   3 files changed, 164 insertions(+), 88 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8b861227daef..85719947fa31 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2494,7 +2494,7 @@ void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
>   }
>   
>   static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
> -							     u64 start, u64 end)
> +							     u64 start)
>   {
>   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   	struct io_failure_record *failrec;
> @@ -2502,6 +2502,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>   	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
>   	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>   	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
> +	const u32 sectorsize = fs_info->sectorsize;
>   	int ret;
>   	u64 logical;
>   
> @@ -2525,7 +2526,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>   		return ERR_PTR(-ENOMEM);
>   
>   	failrec->start = start;
> -	failrec->len = end - start + 1;
> +	failrec->len = sectorsize;
>   	failrec->this_mirror = 0;
>   	failrec->bio_flags = 0;
>   	failrec->in_validation = 0;
> @@ -2564,12 +2565,13 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>   	free_extent_map(em);
>   
>   	/* Set the bits in the private failure tree */
> -	ret = set_extent_bits(failure_tree, start, end,
> +	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
>   			      EXTENT_LOCKED | EXTENT_DIRTY);
>   	if (ret >= 0) {
>   		ret = set_state_failrec(failure_tree, start, failrec);
>   		/* Set the bits in the inode's tree */
> -		ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED);
> +		ret = set_extent_bits(tree, start, start + sectorsize - 1,
> +				      EXTENT_DAMAGED);
>   	} else if (ret < 0) {
>   		kfree(failrec);
>   		return ERR_PTR(ret);
> @@ -2697,11 +2699,11 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
>   	return false;
>   }
>   
> -blk_status_t btrfs_submit_read_repair(struct inode *inode,
> -				      struct bio *failed_bio, u32 bio_offset,
> -				      struct page *page, unsigned int pgoff,
> -				      u64 start, u64 end, int failed_mirror,
> -				      submit_bio_hook_t *submit_bio_hook)
> +int btrfs_repair_one_sector(struct inode *inode,
> +			    struct bio *failed_bio, u32 bio_offset,
> +			    struct page *page, unsigned int pgoff,
> +			    u64 start, int failed_mirror,
> +			    submit_bio_hook_t *submit_bio_hook)
>   {
>   	struct io_failure_record *failrec;
>   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> @@ -2719,16 +2721,22 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
>   
>   	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
>   
> -	failrec = btrfs_get_io_failure_record(inode, start, end);
> +	failrec = btrfs_get_io_failure_record(inode, start);
>   	if (IS_ERR(failrec))
> -		return errno_to_blk_status(PTR_ERR(failrec));
> -
> -	need_validation = btrfs_io_needs_validation(inode, failed_bio);
> +		return PTR_ERR(failrec);
>   
> +	/*
> +	 * We will only submit repair for one sector, thus we don't need
> +	 * extra validation anymore.
> +	 *
> +	 * TODO: All those extra validation related code will be cleaned up
> +	 * later.
> +	 */
> +	need_validation = false;
>   	if (!btrfs_check_repairable(inode, need_validation, failrec,
>   				    failed_mirror)) {
>   		free_io_failure(failure_tree, tree, failrec);
> -		return BLK_STS_IOERR;
> +		return -EIO;
>   	}
>   
>   	repair_bio = btrfs_io_bio_alloc(1);
> @@ -2762,7 +2770,120 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
>   		free_io_failure(failure_tree, tree, failrec);
>   		bio_put(repair_bio);
>   	}
> -	return status;
> +	return blk_status_to_errno(status);
> +}
> +
> +static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> +
> +	ASSERT(page_offset(page) <= start &&
> +	       start + len <= page_offset(page) + PAGE_SIZE);
> +
> +	/*
> +	 * For subapge metadata case, all btrfs_page_* helpers needs page to
> +	 * have page::private populated.
> +	 * But we can have rare case where the last eb in the page is only
> +	 * referred by the IO, and it get released immedately after it's
> +	 * read and verified.
> +	 *
> +	 * This can detach the page private completely.
> +	 * In that case, we can just skip the page status update completely,
> +	 * as the page has no eb any more.
> +	 */

Today I hit a super rare crash, in btrfs_page_set_uptodate() for metadata.
Where the page::private is cleared.

This looks like the extent buffer freeing is still involved.

There is still a race window where the last eb we read, get freed 
between we checking the PagePrivate and we call btrfs_page_set_uptodate().

This means, the PagePrivate() check is unsafe, we need a proper 
protection for this case.

Thankfully, we have subpage::readers which can be easily converted to a 
common member for both data and metadata.

For data it's kept the same behavior, but for metadata we only need to 
skip the page unlocking.

Then teach page_range_has_eb() to make subpage::readers into 
consideration, so that we won't detach subpage when end_page_read() is 
still pending.

I'll send out a proper fix for this particular and rare case.

Thanks,
Qu

> +	if (fs_info->sectorsize < PAGE_SIZE && unlikely(!PagePrivate(page))) {
> +		ASSERT(!is_data_inode(page->mapping->host));
> +		return;
> +	}
> +	if (uptodate) {
> +		btrfs_page_set_uptodate(fs_info, page, start, len);
> +	} else {
> +		btrfs_page_clear_uptodate(fs_info, page, start, len);
> +		btrfs_page_set_error(fs_info, page, start, len);
> +	}
> +
> +	if (fs_info->sectorsize == PAGE_SIZE)
> +		unlock_page(page);
> +	else if (is_data_inode(page->mapping->host))
> +		/*
> +		 * For subpage data, unlock the page if we're the last reader.
> +		 * For subpage metadata, page lock is not utilized for read.
> +		 */
> +		btrfs_subpage_end_reader(fs_info, page, start, len);
> +}
> +
> +static blk_status_t submit_read_repair(struct inode *inode,
> +				      struct bio *failed_bio, u32 bio_offset,
> +				      struct page *page, unsigned int pgoff,
> +				      u64 start, u64 end, int failed_mirror,
> +				      unsigned int error_bitmap,
> +				      submit_bio_hook_t *submit_bio_hook)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	const u32 sectorsize = fs_info->sectorsize;
> +	int error = 0;
> +	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
> +	int i;
> +
> +	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
> +
> +	/* We're here because we had some read errors or csum mismatch */
> +	ASSERT(error_bitmap);
> +
> +	/*
> +	 * We only get called on buffered IO, thus page must be mapped and bio
> +	 * must not be cloned.
> +	 */
> +	ASSERT(page->mapping && !bio_flagged(failed_bio, BIO_CLONED));
> +
> +	/* Iterate through all the sectors in the range */
> +	for (i = 0; i < nr_bits; i++) {
> +		const unsigned int offset = i * sectorsize;
> +		struct extent_state *cached = NULL;
> +		bool uptodate = false;
> +		int ret;
> +
> +		if (!(error_bitmap & (1U << i))) {
> +			/*
> +			 * This sector has no error, just end the page read
> +			 * and unlock the range.
> +			 */
> +			uptodate = true;
> +			goto next;
> +		}
> +
> +		ret = btrfs_repair_one_sector(inode, failed_bio,
> +				bio_offset + offset,
> +				page, pgoff + offset, start + offset,
> +				failed_mirror, submit_bio_hook);
> +		if (!ret) {
> +			/*
> +			 * We have submitted the read repair, the page release
> +			 * will be handled by the endio function of the
> +			 * submitted repair bio.
> +			 * Thus we don't need to do any thing here.
> +			 */
> +			continue;
> +		}
> +		/*
> +		 * Repair failed, just record the error but still continue.
> +		 * Or the remaining sectors will not be properly unlocked.
> +		 */
> +		if (!error)
> +			error = ret;
> +next:
> +		end_page_read(page, uptodate, start + offset, sectorsize);
> +		if (uptodate)
> +			set_extent_uptodate(&BTRFS_I(inode)->io_tree,
> +					start + offset,
> +					start + offset + sectorsize - 1,
> +					&cached, GFP_ATOMIC);
> +		unlock_extent_cached_atomic(&BTRFS_I(inode)->io_tree,
> +				start + offset,
> +				start + offset + sectorsize - 1,
> +				&cached);
> +	}
> +	return errno_to_blk_status(error);
>   }
>   
>   /* lots and lots of room for performance fixes in the end_bio funcs */
> @@ -2919,45 +3040,6 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
>   	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
>   }
>   
> -static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> -{
> -	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> -
> -	ASSERT(page_offset(page) <= start &&
> -		start + len <= page_offset(page) + PAGE_SIZE);
> -
> -	/*
> -	 * For subapge metadata case, all btrfs_page_* helpers needs page to
> -	 * have page::private populate.
> -	 * But we can have rare case where the last eb in the page is only
> -	 * referred by the IO, and it get released immedately after it's
> -	 * read and verified.
> -	 *
> -	 * This can detach the page private completely.
> -	 * In that case, we can just skip the page status update completely,
> -	 * as the page has no eb any more.
> -	 */
> -	if (fs_info->sectorsize < PAGE_SIZE && unlikely(!PagePrivate(page))) {
> -		ASSERT(!is_data_inode(page->mapping->host));
> -		return;
> -	}
> -	if (uptodate) {
> -		btrfs_page_set_uptodate(fs_info, page, start, len);
> -	} else {
> -		btrfs_page_clear_uptodate(fs_info, page, start, len);
> -		btrfs_page_set_error(fs_info, page, start, len);
> -	}
> -
> -	if (fs_info->sectorsize == PAGE_SIZE)
> -		unlock_page(page);
> -	else if (is_data_inode(page->mapping->host))
> -		/*
> -		 * For subpage data, unlock the page if we're the last reader.
> -		 * For subpage metadata, page lock is not utilized for read.
> -		 */
> -		btrfs_subpage_end_reader(fs_info, page, start, len);
> -}
> -
>   /*
>    * Find extent buffer for a givne bytenr.
>    *
> @@ -3001,7 +3083,6 @@ static struct extent_buffer *find_extent_buffer_readpage(
>   static void end_bio_extent_readpage(struct bio *bio)
>   {
>   	struct bio_vec *bvec;
> -	int uptodate = !bio->bi_status;
>   	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>   	struct extent_io_tree *tree, *failure_tree;
>   	struct processed_extent processed = { 0 };
> @@ -3016,10 +3097,12 @@ static void end_bio_extent_readpage(struct bio *bio)
>   
>   	ASSERT(!bio_flagged(bio, BIO_CLONED));
>   	bio_for_each_segment_all(bvec, bio, iter_all) {
> +		bool uptodate = !bio->bi_status;
>   		struct page *page = bvec->bv_page;
>   		struct inode *inode = page->mapping->host;
>   		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   		const u32 sectorsize = fs_info->sectorsize;
> +		unsigned int error_bitmap = (unsigned int)-1;
>   		u64 start;
>   		u64 end;
>   		u32 len;
> @@ -3054,14 +3137,16 @@ static void end_bio_extent_readpage(struct bio *bio)
>   
>   		mirror = io_bio->mirror_num;
>   		if (likely(uptodate)) {
> -			if (is_data_inode(inode))
> -				ret = btrfs_verify_data_csum(io_bio,
> +			if (is_data_inode(inode)) {
> +				error_bitmap = btrfs_verify_data_csum(io_bio,
>   						bio_offset, page, start, end);
> -			else
> +				ret = error_bitmap;
> +			} else {
>   				ret = btrfs_validate_metadata_buffer(io_bio,
>   					page, start, end, mirror);
> +			}
>   			if (ret)
> -				uptodate = 0;
> +				uptodate = false;
>   			else
>   				clean_io_failure(BTRFS_I(inode)->root->fs_info,
>   						 failure_tree, tree, start,
> @@ -3073,27 +3158,19 @@ static void end_bio_extent_readpage(struct bio *bio)
>   			goto readpage_ok;
>   
>   		if (is_data_inode(inode)) {
> -
>   			/*
> -			 * The generic bio_readpage_error handles errors the
> -			 * following way: If possible, new read requests are
> -			 * created and submitted and will end up in
> -			 * end_bio_extent_readpage as well (if we're lucky,
> -			 * not in the !uptodate case). In that case it returns
> -			 * 0 and we just go on with the next page in our bio.
> -			 * If it can't handle the error it will return -EIO and
> -			 * we remain responsible for that page.
> +			 * btrfs_submit_read_repair() will handle all the
> +			 * good and bad sectors, we just continue to next
> +			 * bvec.
>   			 */
> -			if (!btrfs_submit_read_repair(inode, bio, bio_offset,
> -						page,
> -						start - page_offset(page),
> -						start, end, mirror,
> -						btrfs_submit_data_bio)) {
> -				uptodate = !bio->bi_status;
> -				ASSERT(bio_offset + len > bio_offset);
> -				bio_offset += len;
> -				continue;
> -			}
> +			submit_read_repair(inode, bio, bio_offset, page,
> +					   start - page_offset(page), start,
> +					   end, mirror, error_bitmap,
> +					   btrfs_submit_data_bio);
> +
> +			ASSERT(bio_offset + len > bio_offset);
> +			bio_offset += len;
> +			continue;
>   		} else {
>   			struct extent_buffer *eb;
>   
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 1d7bc27719da..c49ce96c8b5d 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -308,11 +308,11 @@ struct io_failure_record {
>   };
>   
>   
> -blk_status_t btrfs_submit_read_repair(struct inode *inode,
> -				      struct bio *failed_bio, u32 bio_offset,
> -				      struct page *page, unsigned int pgoff,
> -				      u64 start, u64 end, int failed_mirror,
> -				      submit_bio_hook_t *submit_bio_hook);
> +int btrfs_repair_one_sector(struct inode *inode,
> +			    struct bio *failed_bio, u32 bio_offset,
> +			    struct page *page, unsigned int pgoff,
> +			    u64 start, int failed_mirror,
> +			    submit_bio_hook_t *submit_bio_hook);
>   
>   #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>   bool find_lock_delalloc_range(struct inode *inode,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1c019b1dc114..6e118b855239 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7938,19 +7938,18 @@ static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
>   						 btrfs_ino(BTRFS_I(inode)),
>   						 pgoff);
>   			} else {
> -				blk_status_t status;
> +				int ret;
>   
>   				ASSERT((start - io_bio->logical) < UINT_MAX);
> -				status = btrfs_submit_read_repair(inode,
> +				ret = btrfs_repair_one_sector(inode,
>   							&io_bio->bio,
>   							start - io_bio->logical,
>   							bvec.bv_page, pgoff,
>   							start,
> -							start + sectorsize - 1,
>   							io_bio->mirror_num,
>   							submit_dio_repair_bio);
> -				if (status)
> -					err = status;
> +				if (ret)
> +					err = errno_to_blk_status(ret);
>   			}
>   			start += sectorsize;
>   			ASSERT(bio_offset + sectorsize > bio_offset);
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-06-07  8:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-12  4:53 [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector Qu Wenruo
2021-05-12  4:53 ` [PATCH v5 1/3] btrfs: make btrfs_verify_data_csum() to return a bitmap Qu Wenruo
2021-05-12  4:53 ` [PATCH v5 2/3] btrfs: submit read time repair only for each corrupted sector Qu Wenruo
2021-06-07  8:27   ` Qu Wenruo
2021-05-12  4:53 ` [PATCH v5 3/3] btrfs: remove io_failure_record::in_validation Qu Wenruo
2021-05-12 13:57 ` [PATCH v5 0/3] btrfs: make read time repair to be only submitted for each corrupted sector David Sterba
2021-05-13 16:45 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.