[PATCH RFC 0/2] btrfs: make read repair work in synchronous mode

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] btrfs: make read repair work in synchronous mode
@ 2022-03-25  9:42 Qu Wenruo
  2022-03-25  9:42 ` [PATCH RFC 1/2] btrfs: introduce a pure data checksum checking helper Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-03-25  9:42 UTC (permalink / raw)
  To: linux-btrfs

The first patch is just a preparation for the 2nd patch, which is also
the core.

It will make repair_one_sector() to wait for the read from other copies
finish, before returning.

This will make the code easier to read, but huge drop in concurrency and
performance for read-repair.

My only justification is read-repair should be a cold path, and we may
be able to afford the change.

Qu Wenruo (2):
  btrfs: introduce a pure data checksum checking helper
  btrfs: do data read repair in synchronous mode

 fs/btrfs/ctree.h     |   2 +
 fs/btrfs/extent_io.c | 111 ++++++++++++++++++++++++++++++-------------
 fs/btrfs/extent_io.h |   3 +-
 fs/btrfs/inode.c     |  73 ++++++++++++++--------------
 4 files changed, 120 insertions(+), 69 deletions(-)

-- 
2.35.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH RFC 1/2] btrfs: introduce a pure data checksum checking helper
  2022-03-25  9:42 [PATCH RFC 0/2] btrfs: make read repair work in synchronous mode Qu Wenruo
@ 2022-03-25  9:42 ` Qu Wenruo
  2022-03-25 10:14   ` Johannes Thumshirn
  2022-03-25  9:42 ` [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode Qu Wenruo
  2022-04-19 19:19 ` [PATCH RFC 0/2] btrfs: make read repair work " David Sterba
  2 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2022-03-25  9:42 UTC (permalink / raw)
  To: linux-btrfs

Although we have several data csum verification code, we never have a
function really just to verify checksum for one sector.

Function check_data_csum() do extra work for error reporting, thus it
requires a lot of extra things like file offset, bio_offset etc.

Function btrfs_verify_data_csum() is even worse, it will utizlie page
checked flag, which means it can not be utilized for direct IO pages.

Here we introduce a new helper, btrfs_check_data_sector(), which really
only accept a sector in page, and expected checksum pointer.

We use this function to implement check_data_csum(), and export it for
incoming patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 48 ++++++++++++++++++++++++++++++++++++------------
 2 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 09b7b0b2d016..dd26f6995925 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3254,6 +3254,8 @@ u64 btrfs_file_extent_end(const struct btrfs_path *path);
 /* inode.c */
 blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 				   int mirror_num, unsigned long bio_flags);
+int btrfs_check_data_sector(struct btrfs_fs_info *fs_info, struct page *page,
+			    u32 pgoff, u8 *csum_expected);
 unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
 				    u32 bio_offset, struct page *page,
 				    u64 start, u64 end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 179d479b72e9..3ba1315784f6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3254,6 +3254,37 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 				       finish_ordered_fn, uptodate);
 }
 
+/*
+ * Just verify one sector of csum, without any extra handling.
+ *
+ * Functions like btrfs_verify_data_csum() have extra handling like setting
+ * page flags which is not suitable for call sites like direct IO.
+ *
+ * This pure csum checking allows us to utilize it for call sites which need
+ * to handle page for both buffered and direct IO.
+ */
+int btrfs_check_data_sector(struct btrfs_fs_info *fs_info, struct page *page,
+			    u32 pgoff, u8 *csum_expected)
+{
+	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	char *kaddr;
+	const u32 len = fs_info->sectorsize;
+	const u32 csum_size = fs_info->csum_size;
+	u8 csum[BTRFS_CSUM_SIZE];
+
+	ASSERT(pgoff + len <= PAGE_SIZE);
+
+	kaddr = kmap_atomic(page);
+	shash->tfm = fs_info->csum_shash;
+
+	crypto_shash_digest(shash, kaddr + pgoff, len, csum);
+	kunmap_atomic(kaddr);
+
+	if (memcmp(csum, csum_expected, csum_size))
+		return -EIO;
+	return 0;
+}
+
 /*
  * check_data_csum - verify checksum of one sector of uncompressed data
  * @inode:	inode
@@ -3264,14 +3295,15 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
  * @start:	logical offset in the file
  *
  * The length of such check is always one sector size.
+ *
+ * When csum mismatch detected, we will also report the error and fill the
+ * corrupted range with zero. (thus it needs the extra parameters)
  */
 static int check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
 			   u32 bio_offset, struct page *page, u32 pgoff,
 			   u64 start)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
-	char *kaddr;
 	const u32 len = fs_info->sectorsize;
 	const u32 csum_size = fs_info->csum_size;
 	unsigned int offset_sectors;
@@ -3283,17 +3315,9 @@ static int check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
 	offset_sectors = bio_offset >> fs_info->sectorsize_bits;
 	csum_expected = ((u8 *)bbio->csum) + offset_sectors * csum_size;
 
-	kaddr = kmap_atomic(page);
-	shash->tfm = fs_info->csum_shash;
-
-	crypto_shash_digest(shash, kaddr + pgoff, len, csum);
-	kunmap_atomic(kaddr);
-
-	if (memcmp(csum, csum_expected, csum_size))
-		goto zeroit;
+	if (!btrfs_check_data_sector(fs_info, page, pgoff, csum_expected))
+		return 0;
 
-	return 0;
-zeroit:
 	btrfs_print_data_csum_error(BTRFS_I(inode), start, csum, csum_expected,
 				    bbio->mirror_num);
 	if (bbio->device)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-25  9:42 [PATCH RFC 0/2] btrfs: make read repair work in synchronous mode Qu Wenruo
  2022-03-25  9:42 ` [PATCH RFC 1/2] btrfs: introduce a pure data checksum checking helper Qu Wenruo
@ 2022-03-25  9:42 ` Qu Wenruo
  2022-03-25 10:27   ` Christoph Hellwig
  2022-04-19 19:19 ` [PATCH RFC 0/2] btrfs: make read repair work " David Sterba
  2 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2022-03-25  9:42 UTC (permalink / raw)
  To: linux-btrfs

[CRYPTIC READ REPAIR]
Currently if we hit a sector with mismatched csum, even we're already in
endio workqueue, we still submit a new bio to read the same sector with
different mirror number.

And let the endio function to handle the new bio again.

Such iterative call is not only harder to grasp, but also require extra
failure record mechanism to record which mirror has corrupted data.

[SYNCHRONOUS REPAIR ADVANTAGE]
However in scrub, we go another way by waiting for read from other
mirrors.

Such synchronous repair makes code much easier to read, and requires no
extra failure record.

Another advantage is, we no longer need the submit_bio_hook for
submit_read_repair().

As the new repair bio only needs to go through btrfs_map_bio() thus we
don't need different handling for submitting the repair bio.

[SYNCHRONOUS REPAIR DISADVANTAGE]
The biggest problem for synchronous repair is performance.
Scrub can afford it as it's running in background.

But for regular read, if we wait for synchronous repair it can be as
slow as a snail for the worst case.

If we have a 128K bio, and all data are corrupted, requiring read from
other copy.

Synchronous repair will repair the first sector, then the 2nd, until the
32th sector, greatly slow down the read.

[REASON FOR RFC]
I'm not sure if such huge performance downgrade can be acceptable for
end-users.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 111 ++++++++++++++++++++++++++++++-------------
 fs/btrfs/extent_io.h |   3 +-
 fs/btrfs/inode.c     |  25 +---------
 3 files changed, 82 insertions(+), 57 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 53b59944013f..3c80695d5627 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2627,11 +2627,47 @@ static bool btrfs_check_repairable(struct inode *inode,
 	return true;
 }
 
+static void submit_bio_wait_endio(struct bio *bio)
+{
+	complete(bio->bi_private);
+}
+
+static int submit_wait_mirror(struct inode *inode, struct bio *repair_bio,
+			      int mirror)
+{
+	blk_status_t ret;
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	repair_bio->bi_opf = REQ_OP_READ;
+	repair_bio->bi_private = &done;
+	repair_bio->bi_end_io = submit_bio_wait_endio;
+	repair_bio->bi_opf |= REQ_SYNC;
+
+	/*
+	 * All users of read-repair is for data read, and already in their own
+	 * workqueue, with btrfs_bio(repair_bio)->csum already properly filled.
+	 * So here we just need to map and submit it.
+	 */
+	ret = btrfs_map_bio(BTRFS_I(inode)->root->fs_info, repair_bio, mirror);
+	if (ret < 0) {
+		repair_bio->bi_status = ret;
+		bio_endio(repair_bio);
+		return ret;
+	}
+
+	wait_for_completion_io(&done);
+	return blk_status_to_errno(repair_bio->bi_status);
+}
+
+/*
+ * Repair one corrupted sector.
+ *
+ * This function will only return after he repair finished.
+ */
 int btrfs_repair_one_sector(struct inode *inode,
 			    struct bio *failed_bio, u32 bio_offset,
 			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook)
+			    u64 start, int failed_mirror)
 {
 	struct io_failure_record *failrec;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2641,6 +2677,9 @@ int btrfs_repair_one_sector(struct inode *inode,
 	const int icsum = bio_offset >> fs_info->sectorsize_bits;
 	struct bio *repair_bio;
 	struct btrfs_bio *repair_bbio;
+	int i;
+	int num_copies;
+	bool found_good = false;
 
 	btrfs_debug(fs_info,
 		   "repair read error: read error at %llu", start);
@@ -2657,12 +2696,12 @@ int btrfs_repair_one_sector(struct inode *inode,
 		return -EIO;
 	}
 
+	num_copies = btrfs_num_copies(fs_info, failrec->logical,
+				      fs_info->sectorsize);
+	ASSERT(num_copies > 1);
 	repair_bio = btrfs_bio_alloc(1);
 	repair_bbio = btrfs_bio(repair_bio);
-	repair_bio->bi_opf = REQ_OP_READ;
-	repair_bio->bi_end_io = failed_bio->bi_end_io;
 	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
-	repair_bio->bi_private = failed_bio->bi_private;
 
 	if (failed_bbio->csum) {
 		const u32 csum_size = fs_info->csum_size;
@@ -2679,13 +2718,36 @@ int btrfs_repair_one_sector(struct inode *inode,
 		    "repair read error: submitting new read to mirror %d",
 		    failrec->this_mirror);
 
-	/*
-	 * At this point we have a bio, so any errors from submit_bio_hook()
-	 * will be handled by the endio on the repair_bio, so we can't return an
-	 * error here.
-	 */
-	submit_bio_hook(inode, repair_bio, failrec->this_mirror, failrec->bio_flags);
-	return BLK_STS_OK;
+	for (i = 1; i < num_copies; i++) {
+		int mirror;
+		int ret;
+
+		mirror = failed_mirror + i;
+		if (mirror > num_copies)
+			mirror -= num_copies;
+		ASSERT(mirror <= num_copies);
+
+		ret = submit_wait_mirror(inode, repair_bio, mirror);
+		/* Failed to submit, try next mirror */
+		if (ret < 0)
+			continue;
+
+		/* Verify the checksum */
+		if (failed_bbio->csum)
+			ret = btrfs_check_data_sector(fs_info, page, pgoff,
+						      repair_bbio->csum);
+		if (ret == 0) {
+			found_good = true;
+			break;
+		}
+	}
+	if (!found_good)
+		return -EIO;
+
+	/* Don't forget the cleaer the failure record */
+	clean_io_failure(BTRFS_I(inode)->root->fs_info, failure_tree, tree, start,
+			 page, btrfs_ino(BTRFS_I(inode)), pgoff);
+	return 0;
 }
 
 static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
@@ -2720,8 +2782,7 @@ static blk_status_t submit_read_repair(struct inode *inode,
 				      struct bio *failed_bio, u32 bio_offset,
 				      struct page *page, unsigned int pgoff,
 				      u64 start, u64 end, int failed_mirror,
-				      unsigned int error_bitmap,
-				      submit_bio_hook_t *submit_bio_hook)
+				      unsigned int error_bitmap)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	const u32 sectorsize = fs_info->sectorsize;
@@ -2759,15 +2820,10 @@ static blk_status_t submit_read_repair(struct inode *inode,
 		ret = btrfs_repair_one_sector(inode, failed_bio,
 				bio_offset + offset,
 				page, pgoff + offset, start + offset,
-				failed_mirror, submit_bio_hook);
+				failed_mirror);
 		if (!ret) {
-			/*
-			 * We have submitted the read repair, the page release
-			 * will be handled by the endio function of the
-			 * submitted repair bio.
-			 * Thus we don't need to do any thing here.
-			 */
-			continue;
+			uptodate = true;
+			goto next;
 		}
 		/*
 		 * Repair failed, just record the error but still continue.
@@ -2994,7 +3050,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
-	struct extent_io_tree *tree, *failure_tree;
 	struct processed_extent processed = { 0 };
 	/*
 	 * The offset to the beginning of a bio, since one bio can never be
@@ -3021,8 +3076,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			"end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u",
 			bio->bi_iter.bi_sector, bio->bi_status,
 			bbio->mirror_num);
-		tree = &BTRFS_I(inode)->io_tree;
-		failure_tree = &BTRFS_I(inode)->io_failure_tree;
 
 		/*
 		 * We always issue full-sector reads, but if some block in a
@@ -3057,11 +3110,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 			}
 			if (ret)
 				uptodate = false;
-			else
-				clean_io_failure(BTRFS_I(inode)->root->fs_info,
-						 failure_tree, tree, start,
-						 page,
-						 btrfs_ino(BTRFS_I(inode)), 0);
 		}
 
 		if (likely(uptodate))
@@ -3082,8 +3130,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 			 */
 			submit_read_repair(inode, bio, bio_offset, page,
 					   start - page_offset(page), start,
-					   end, mirror, error_bitmap,
-					   btrfs_submit_data_bio);
+					   end, mirror, error_bitmap);
 
 			ASSERT(bio_offset + len > bio_offset);
 			bio_offset += len;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 151e9da5da2d..a8517ffc92cd 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -305,8 +305,7 @@ struct io_failure_record {
 int btrfs_repair_one_sector(struct inode *inode,
 			    struct bio *failed_bio, u32 bio_offset,
 			    struct page *page, unsigned int pgoff,
-			    u64 start, int failed_mirror,
-			    submit_bio_hook_t *submit_bio_hook);
+			    u64 start, int failed_mirror);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ba1315784f6..b11faafabe01 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7761,27 +7761,6 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	kfree(dip);
 }
 
-static blk_status_t submit_dio_repair_bio(struct inode *inode, struct bio *bio,
-					  int mirror_num,
-					  unsigned long bio_flags)
-{
-	struct btrfs_dio_private *dip = bio->bi_private;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	blk_status_t ret;
-
-	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
-
-	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
-	if (ret)
-		return ret;
-
-	refcount_inc(&dip->refs);
-	ret = btrfs_map_bio(fs_info, bio, mirror_num);
-	if (ret)
-		refcount_dec(&dip->refs);
-	return ret;
-}
-
 static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 					     struct btrfs_bio *bbio,
 					     const bool uptodate)
@@ -7818,12 +7797,12 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 				int ret;
 
 				ASSERT((start - orig_file_offset) < UINT_MAX);
+
 				ret = btrfs_repair_one_sector(inode,
 						&bbio->bio,
 						start - orig_file_offset,
 						bvec.bv_page, pgoff,
-						start, bbio->mirror_num,
-						submit_dio_repair_bio);
+						start, bbio->mirror_num);
 				if (ret)
 					err = errno_to_blk_status(ret);
 			}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 1/2] btrfs: introduce a pure data checksum checking helper
  2022-03-25  9:42 ` [PATCH RFC 1/2] btrfs: introduce a pure data checksum checking helper Qu Wenruo
@ 2022-03-25 10:14   ` Johannes Thumshirn
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Thumshirn @ 2022-03-25 10:14 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good in case 2/2 is ok as well,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-25  9:42 ` [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode Qu Wenruo
@ 2022-03-25 10:27   ` Christoph Hellwig
  2022-03-25 10:53     ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2022-03-25 10:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

I'd suggest to at least submit all I/O in parallel.  Just put
a structure with an atomic_t and completion on the stack.  Start with
a recount of one, inc for each submit, and then dec after all
submissions and for each completion and only wake on the final dec.
That will make sure there is only one wait instead of one per copy.

Extra benefits for also doing this for all sectors, but that might be
a little more work.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-25 10:27   ` Christoph Hellwig
@ 2022-03-25 10:53     ` Qu Wenruo
  2022-03-25 14:58       ` Sweet Tea Dorminy
  2022-03-25 16:43       ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-03-25 10:53 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/3/25 18:27, Christoph Hellwig wrote:
> I'd suggest to at least submit all I/O in parallel.  Just put
> a structure with an atomic_t and completion on the stack.  Start with
> a recount of one, inc for each submit, and then dec after all
> submissions and for each completion and only wake on the final dec.
> That will make sure there is only one wait instead of one per copy.
>
> Extra benefits for also doing this for all sectors, but that might be
> a little more work.

Exactly the same plan B in my head.

A small problem is related to how to record all these corrupted sectors.

Using a on-stack bitmap would be the best, and it's feasible for x86_64
at least.
(256 bvecs = 256x4K pages = 256 sectors = 256bits).

But not that sure for larger page size.
As the same 256 bvecs can go 256x64K pages = 256 * 16 sectors, way too
large for on-stack bitmap.


If we go list, we need extra memory allocation, which can be feasible
but less simple.

Let's see how the simplest way go, if not that good, we still have a plan B.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-25 10:53     ` Qu Wenruo
@ 2022-03-25 14:58       ` Sweet Tea Dorminy
  2022-03-26  0:13         ` Qu Wenruo
  2022-03-25 16:43       ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Sweet Tea Dorminy @ 2022-03-25 14:58 UTC (permalink / raw)
  To: Qu Wenruo, Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 3/25/22 06:53, Qu Wenruo wrote:
> 
> 
> On 2022/3/25 18:27, Christoph Hellwig wrote:
>> I'd suggest to at least submit all I/O in parallel.  Just put
>> a structure with an atomic_t and completion on the stack.  Start with
>> a recount of one, inc for each submit, and then dec after all
>> submissions and for each completion and only wake on the final dec.
>> That will make sure there is only one wait instead of one per copy.
>>
>> Extra benefits for also doing this for all sectors, but that might be
>> a little more work.
> 
> Exactly the same plan B in my head.
> 
> A small problem is related to how to record all these corrupted sectors.
> 
> Using a on-stack bitmap would be the best, and it's feasible for x86_64
> at least.
> (256 bvecs = 256x4K pages = 256 sectors = 256bits).
> 
> But not that sure for larger page size.
> As the same 256 bvecs can go 256x64K pages = 256 * 16 sectors, way too
> large for on-stack bitmap.
I'm not understanding the reason why each bvec can only have one page... 
as far as I know, a bvec could have any number of contiguous pages, 
making it infeasible for x86_64 too, but maybe I'm missing some 
constraint in their construction to make it one-to-one.
> 
> 
> If we go list, we need extra memory allocation, which can be feasible
> but less simple.
> 
> Let's see how the simplest way go, if not that good, we still have a 
> plan B.
> 
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-25 10:53     ` Qu Wenruo
  2022-03-25 14:58       ` Sweet Tea Dorminy
@ 2022-03-25 16:43       ` Christoph Hellwig
  2022-03-26  0:12         ` Qu Wenruo
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2022-03-25 16:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Fri, Mar 25, 2022 at 06:53:31PM +0800, Qu Wenruo wrote:
> > I'd suggest to at least submit all I/O in parallel.  Just put
> > a structure with an atomic_t and completion on the stack.  Start with
> > a recount of one, inc for each submit, and then dec after all
> > submissions and for each completion and only wake on the final dec.
> > That will make sure there is only one wait instead of one per copy.
> > 
> > Extra benefits for also doing this for all sectors, but that might be
> > a little more work.
> 
> Exactly the same plan B in my head.
> 
> A small problem is related to how to record all these corrupted sectors.
> 
> Using a on-stack bitmap would be the best, and it's feasible for x86_64
> at least.
> (256 bvecs = 256x4K pages = 256 sectors = 256bits).
> But not that sure for larger page size.
> As the same 256 bvecs can go 256x64K pages = 256 * 16 sectors, way too
> large for on-stack bitmap.
> 
> 
> If we go list, we need extra memory allocation, which can be feasible
> but less simple.

Just chunk it up so that only up to a reasonable bitmap size worth of
I/O is done at a time.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-25 16:43       ` Christoph Hellwig
@ 2022-03-26  0:12         ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-03-26  0:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs



On 2022/3/26 00:43, Christoph Hellwig wrote:
> On Fri, Mar 25, 2022 at 06:53:31PM +0800, Qu Wenruo wrote:
>>> I'd suggest to at least submit all I/O in parallel.  Just put
>>> a structure with an atomic_t and completion on the stack.  Start with
>>> a recount of one, inc for each submit, and then dec after all
>>> submissions and for each completion and only wake on the final dec.
>>> That will make sure there is only one wait instead of one per copy.
>>>
>>> Extra benefits for also doing this for all sectors, but that might be
>>> a little more work.
>>
>> Exactly the same plan B in my head.
>>
>> A small problem is related to how to record all these corrupted sectors.
>>
>> Using a on-stack bitmap would be the best, and it's feasible for x86_64
>> at least.
>> (256 bvecs = 256x4K pages = 256 sectors = 256bits).
>> But not that sure for larger page size.
>> As the same 256 bvecs can go 256x64K pages = 256 * 16 sectors, way too
>> large for on-stack bitmap.
>>
>>
>> If we go list, we need extra memory allocation, which can be feasible
>> but less simple.
>
> Just chunk it up so that only up to a reasonable bitmap size worth of
> I/O is done at a time.

But it can be fragmented. Thus even we chunk them up, we may still need
as many submission as the corrupted sectors.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-25 14:58       ` Sweet Tea Dorminy
@ 2022-03-26  0:13         ` Qu Wenruo
  2022-03-26  1:10           ` Qu Wenruo
  2022-03-29  6:16           ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-03-26  0:13 UTC (permalink / raw)
  To: Sweet Tea Dorminy, Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/3/25 22:58, Sweet Tea Dorminy wrote:
>
>
> On 3/25/22 06:53, Qu Wenruo wrote:
>>
>>
>> On 2022/3/25 18:27, Christoph Hellwig wrote:
>>> I'd suggest to at least submit all I/O in parallel.  Just put
>>> a structure with an atomic_t and completion on the stack.  Start with
>>> a recount of one, inc for each submit, and then dec after all
>>> submissions and for each completion and only wake on the final dec.
>>> That will make sure there is only one wait instead of one per copy.
>>>
>>> Extra benefits for also doing this for all sectors, but that might be
>>> a little more work.
>>
>> Exactly the same plan B in my head.
>>
>> A small problem is related to how to record all these corrupted sectors.
>>
>> Using a on-stack bitmap would be the best, and it's feasible for x86_64
>> at least.
>> (256 bvecs = 256x4K pages = 256 sectors = 256bits).
>>
>> But not that sure for larger page size.
>> As the same 256 bvecs can go 256x64K pages = 256 * 16 sectors, way too
>> large for on-stack bitmap.
> I'm not understanding the reason why each bvec can only have one page...
> as far as I know, a bvec could have any number of contiguous pages,
> making it infeasible for x86_64 too, but maybe I'm missing some
> constraint in their construction to make it one-to-one.

Btrfs doesn't utilize that multi-page bvec at all.

So still one bvec one page.

Thanks,
Qu

>>
>>
>> If we go list, we need extra memory allocation, which can be feasible
>> but less simple.
>>
>> Let's see how the simplest way go, if not that good, we still have a
>> plan B.
>>
>> Thanks,
>> Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-26  0:13         ` Qu Wenruo
@ 2022-03-26  1:10           ` Qu Wenruo
  2022-03-29  6:17             ` Christoph Hellwig
  2022-03-29  6:16           ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2022-03-26  1:10 UTC (permalink / raw)
  To: Sweet Tea Dorminy, Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/3/26 08:13, Qu Wenruo wrote:
>
>
> On 2022/3/25 22:58, Sweet Tea Dorminy wrote:
>>
>>
>> On 3/25/22 06:53, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/3/25 18:27, Christoph Hellwig wrote:
>>>> I'd suggest to at least submit all I/O in parallel.  Just put
>>>> a structure with an atomic_t and completion on the stack.  Start with
>>>> a recount of one, inc for each submit, and then dec after all
>>>> submissions and for each completion and only wake on the final dec.
>>>> That will make sure there is only one wait instead of one per copy.
>>>>
>>>> Extra benefits for also doing this for all sectors, but that might be
>>>> a little more work.
>>>
>>> Exactly the same plan B in my head.
>>>
>>> A small problem is related to how to record all these corrupted sectors.
>>>
>>> Using a on-stack bitmap would be the best, and it's feasible for x86_64
>>> at least.
>>> (256 bvecs = 256x4K pages = 256 sectors = 256bits).
>>>
>>> But not that sure for larger page size.
>>> As the same 256 bvecs can go 256x64K pages = 256 * 16 sectors, way too
>>> large for on-stack bitmap.
>> I'm not understanding the reason why each bvec can only have one page...
>> as far as I know, a bvec could have any number of contiguous pages,
>> making it infeasible for x86_64 too, but maybe I'm missing some
>> constraint in their construction to make it one-to-one.
>
> Btrfs doesn't utilize that multi-page bvec at all.

All my bad.

Thanks for your hint on this, the more correct term is, btrfs always
iterate the bvec using single page helper, bio_for_each_segment_all().

Thus even if the bvec has multiple pages in it, we still treat it as
single page using bvec_iter.

And yes, that would make on-stack bitmap even less feasible on x86_64.

Thanks,
Qu

>
> So still one bvec one page.
>
> Thanks,
> Qu
>
>>>
>>>
>>> If we go list, we need extra memory allocation, which can be feasible
>>> but less simple.
>>>
>>> Let's see how the simplest way go, if not that good, we still have a
>>> plan B.
>>>
>>> Thanks,
>>> Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-26  0:13         ` Qu Wenruo
  2022-03-26  1:10           ` Qu Wenruo
@ 2022-03-29  6:16           ` Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2022-03-29  6:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Sweet Tea Dorminy, Christoph Hellwig, Qu Wenruo, linux-btrfs

On Sat, Mar 26, 2022 at 08:13:33AM +0800, Qu Wenruo wrote:
> > I'm not understanding the reason why each bvec can only have one page...
> > as far as I know, a bvec could have any number of contiguous pages,
> > making it infeasible for x86_64 too, but maybe I'm missing some
> > constraint in their construction to make it one-to-one.
> 
> Btrfs doesn't utilize that multi-page bvec at all.

That is not true.  multi-page bvec are automatically created by
bio_add_page which is used by btrfs everywhere.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode
  2022-03-26  1:10           ` Qu Wenruo
@ 2022-03-29  6:17             ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2022-03-29  6:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Sweet Tea Dorminy, Christoph Hellwig, Qu Wenruo, linux-btrfs

On Sat, Mar 26, 2022 at 09:10:51AM +0800, Qu Wenruo wrote:
> Thanks for your hint on this, the more correct term is, btrfs always
> iterate the bvec using single page helper, bio_for_each_segment_all().
> 
> Thus even if the bvec has multiple pages in it, we still treat it as
> single page using bvec_iter.

Yes, like most file systems.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/2] btrfs: make read repair work in synchronous mode
  2022-03-25  9:42 [PATCH RFC 0/2] btrfs: make read repair work in synchronous mode Qu Wenruo
  2022-03-25  9:42 ` [PATCH RFC 1/2] btrfs: introduce a pure data checksum checking helper Qu Wenruo
  2022-03-25  9:42 ` [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode Qu Wenruo
@ 2022-04-19 19:19 ` David Sterba
  2022-04-19 23:22   ` Qu Wenruo
  2 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2022-04-19 19:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Mar 25, 2022 at 05:42:47PM +0800, Qu Wenruo wrote:
> The first patch is just a preparation for the 2nd patch, which is also
> the core.
> 
> It will make repair_one_sector() to wait for the read from other copies
> finish, before returning.
> 
> This will make the code easier to read, but huge drop in concurrency and
> performance for read-repair.
> 
> My only justification is read-repair should be a cold path, and we may
> be able to afford the change.
> 
> Qu Wenruo (2):
>   btrfs: introduce a pure data checksum checking helper
>   btrfs: do data read repair in synchronous mode

After reading the discussion, is it right that you're going to implement
the repair in another way?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/2] btrfs: make read repair work in synchronous mode
  2022-04-19 19:19 ` [PATCH RFC 0/2] btrfs: make read repair work " David Sterba
@ 2022-04-19 23:22   ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-04-19 23:22 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2022/4/20 03:19, David Sterba wrote:
> On Fri, Mar 25, 2022 at 05:42:47PM +0800, Qu Wenruo wrote:
>> The first patch is just a preparation for the 2nd patch, which is also
>> the core.
>>
>> It will make repair_one_sector() to wait for the read from other copies
>> finish, before returning.
>>
>> This will make the code easier to read, but huge drop in concurrency and
>> performance for read-repair.
>>
>> My only justification is read-repair should be a cold path, and we may
>> be able to afford the change.
>>
>> Qu Wenruo (2):
>>    btrfs: introduce a pure data checksum checking helper
>>    btrfs: do data read repair in synchronous mode
>
> After reading the discussion, is it right that you're going to implement
> the repair in another way?

Yes, this is just RFC and I didn't expect it get merged anyway due to
the big performance drop, even it's just for repair performance.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-04-19 23:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-25  9:42 [PATCH RFC 0/2] btrfs: make read repair work in synchronous mode Qu Wenruo
2022-03-25  9:42 ` [PATCH RFC 1/2] btrfs: introduce a pure data checksum checking helper Qu Wenruo
2022-03-25 10:14   ` Johannes Thumshirn
2022-03-25  9:42 ` [PATCH RFC 2/2] btrfs: do data read repair in synchronous mode Qu Wenruo
2022-03-25 10:27   ` Christoph Hellwig
2022-03-25 10:53     ` Qu Wenruo
2022-03-25 14:58       ` Sweet Tea Dorminy
2022-03-26  0:13         ` Qu Wenruo
2022-03-26  1:10           ` Qu Wenruo
2022-03-29  6:17             ` Christoph Hellwig
2022-03-29  6:16           ` Christoph Hellwig
2022-03-25 16:43       ` Christoph Hellwig
2022-03-26  0:12         ` Qu Wenruo
2022-04-19 19:19 ` [PATCH RFC 0/2] btrfs: make read repair work " David Sterba
2022-04-19 23:22   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.