linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time
@ 2021-12-06  2:29 Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 01/17] btrfs: update an stale comment on btrfs_submit_bio_hook() Qu Wenruo
                   ` (17 more replies)
  0 siblings, 18 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

This patchset be fetched from this branch:

https://github.com/adam900710/linux/tree/refactor_chunk_map

[BACKGROUND]

Currently btrfs never uses bio_split() to split its bio against RAID
stripe boundaries.

Instead inside btrfs we check our stripe boundary everytime we allocate
a new bio, and ensure the new bio never cross stripe boundaries.

[PROBLEMS]

Although this behavior works fine, it's against the common practice used in
stacked drivers, and is making the effort to convert to iomap harder.

There is also an hidden burden, every time we allocate a new bio, we uses
BIO_MAX_BVECS, but since we know the boundaries, for RAID0/RAID10 we can
only fit at most 16 pages (fixed 64K stripe size, and 4K page size),
wasting the 256 slots we allocated.

[CHALLENGES]

To change the situation, this patchset attempts to improve the situation
by moving the bio split into btrfs_map_bio() time, so upper layer should
no longer bother the bio split against RAID stripes or even chunk
boundaries.

But there are several challenges:

- Conflicts in various endio functions
  We want the existing granularity, instead of chained endio, thus we
  must make the involved endio functions to handle split bios.

  Although most endio functions are already doing their works
  independent of the bio size, they are not yet fully handling split
  bio.

  This patch will convert them to use saved bi_iter and only iterate
  the split range instead of the whole bio.
  This change involved 3 types of IOs:

  * Buffered IO
    Including both data and metadata
  * Direct IO
  * Compressed IO

  Their endio functions needs different level of updates to handle split
  bios.

  Furthermore, there is another endio, end_workqueue_bio(), it can't
  handle split bios at all, thus we change the timing so that
  btrfs_bio_wq_end_io() is only called after the bio being split.

- Checksum verification
  Currently we rely on btrfs_bio::csum to contain the checksum for the
  whole bio.
  If one bio get split, csum will no longer points to the correct
  location for the split bio.

  This can be solved by introducing btrfs_bio::offset_to_original, and
  use that new member to calculate where we should read csum from.

  For the parent bio, it still has btrfs_bio::csum for the whole bio,
  thus it can still free it correctly.

- Independent endio for each split bio
  Unlike stack drivers, for RAID10 btrfs needs to try its best effort to
  read every sectors, to handle the following case: (X means bad, either
  unable to read or failed to pass checksum verification, V means good)

  Dev 1	(missing) | D1 (X) |
  Dev 2 (OK)	  | D1 (V) |
  Dev 3 (OK)	  | D2 (V) |
  Dev 4 (OK)	  | D2 (X) |

  In the above RAID10 case, dev1 is missing, and although dev4 is fine,
  its D2 sector is corrupted (by bit rot or whatever).

  If we use bio_chain(), read bio for both D1 and D2 will be split, and
  since D1 is missing, the whole D1 and D2 read will be marked as error,
  thus we will try to read from dev2 and dev4.

  But D2 in dev4 has csum mismatch, we can only rebuild D1 and D2
  correctly from dev2:D1 and dev3:D2.

  This patchset resolve this by saving bi_iter into btrfs_bio::iter, and
  uses that at endio to iterate only the split part of an bio.
  Other than this, existing read/write page endio functions can handle
  them properly without problem.

- Bad RAID56 naming/functionality
  There are quite some RAID56 call sites relies on specific behavior on
  __btrfs_map_block(), like returning @map_length as stripe_len other
  than real mapped length.

  This is handled by some small cleanups specific for RAID56.

[CHANGELOG]
RFC->v1:
- Better patch split
  Now patch 01~06 are refactors/cleanups/preparations.
  While 07~13 are the patches that doing the conversion while can handle
  both old and new bio split timings.
  Finally patch 14~16 convert the bio split call sites one by one to
  newer facility.
  The final patch is just a small clean up.

- Various bug fixes
  During the full fstests run, various stupid bugs are exposed and
  fixed.

v2:
- Fix the error paths for allocated but never submitted bios
  There are tons of error path that we allocate a bio but it goes
  bio_endio() directly without going through btrfs_map_bio().
  New ASSERTS() in endio functions require a populated btrfs_bio::iter,
  thus for such bios we still need to call btrfs_bio_save_iter() to
  populate btrfs_bio::iter to prevent such ASSERT()s get triggered.

- Fix scrub_stripe_index_and_offset() which abuses stripe_len and
  mapped_length

Qu Wenruo (17):
  btrfs: update an stale comment on btrfs_submit_bio_hook()
  btrfs: save bio::bi_iter into btrfs_bio::iter before any endio
  btrfs: use correct bio size for error message in btrfs_end_dio_bio()
  btrfs: refactor btrfs_map_bio()
  btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio()
  btrfs: replace btrfs_dio_private::refs with
    btrfs_dio_private::pending_bytes
  btrfs: introduce btrfs_bio_split() helper
  btrfs: make data buffered read path to handle split bio properly
  btrfs: make data buffered write endio function to be split bio
    compatible
  btrfs: make metadata write endio functions to be split bio compatible
  btrfs: make dec_and_test_compressed_bio() to be split bio compatible
  btrfs: return proper mapped length for RAID56 profiles in
    __btrfs_map_block()
  btrfs: allow btrfs_map_bio() to split bio according to chunk stripe
    boundaries
  btrfs: remove buffered IO stripe boundary calculation
  btrfs: remove stripe boundary calculation for compressed IO
  btrfs: remove the stripe boundary calculation for direct IO
  btrfs: unexport btrfs_get_io_geometry()

 fs/btrfs/btrfs_inode.h |  10 +-
 fs/btrfs/compression.c |  70 +++-----------
 fs/btrfs/disk-io.c     |  11 +--
 fs/btrfs/extent_io.c   | 196 ++++++++++++++++++++++++++------------
 fs/btrfs/extent_io.h   |   2 +
 fs/btrfs/inode.c       | 210 ++++++++++++++++-------------------------
 fs/btrfs/raid56.c      |  14 ++-
 fs/btrfs/raid56.h      |   2 +-
 fs/btrfs/scrub.c       |  14 +--
 fs/btrfs/volumes.c     | 144 ++++++++++++++++++++--------
 fs/btrfs/volumes.h     |  74 +++++++++++++--
 11 files changed, 436 insertions(+), 311 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 01/17] btrfs: update an stale comment on btrfs_submit_bio_hook()
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 02/17] btrfs: save bio::bi_iter into btrfs_bio::iter before any endio Qu Wenruo
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

This function is renamed to btrfs_submit_data_bio(), update the comment
and add extra reason why it doesn't completely follow the same rule in
btrfs_submit_data_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 504cf090fc88..6079d30f83e8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8202,7 +8202,13 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 	bool write = btrfs_op(bio) == BTRFS_MAP_WRITE;
 	blk_status_t ret;
 
-	/* Check btrfs_submit_bio_hook() for rules about async submit. */
+	/*
+	 * Check btrfs_submit_data_bio() for rules about async submit.
+	 *
+	 * The only exception is for RAID56, when there are more than one bios
+	 * to submit, async submit seems to make it harder to collect csums
+	 * for the full stripe.
+	 */
 	if (async_submit)
 		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 02/17] btrfs: save bio::bi_iter into btrfs_bio::iter before any endio
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 01/17] btrfs: update an stale comment on btrfs_submit_bio_hook() Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 03/17] btrfs: use correct bio size for error message in btrfs_end_dio_bio() Qu Wenruo
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs_bio::iter is only utilized by direct IO.

But later we will utilize btrfs_bio::iter to record the original
bi_iter, for all endio functions to iterate the original range.

Thus this patch will introduce a new helper, btrfs_bio_save_iter(), to
save bi_iter into btrfs_bio::iter.

All path that can lead to an bio_endio() call needs such
btrfs_bio_save_iter() call.

Under most common case, there will be a btrfs_map_bio() call to handle
submitted bios.

While for other error out paths, we need to call btrfs_bio_save_iter()
manually, or later endio functions will ASSERT() on empty
btrfs_bio::iter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c |  3 +++
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/extent_io.c   |  7 +++++++
 fs/btrfs/raid56.c      |  2 ++
 fs/btrfs/volumes.c     |  1 +
 fs/btrfs/volumes.h     | 17 +++++++++++++++++
 6 files changed, 32 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index e776956d5bc9..cc8d13369f53 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -870,6 +870,9 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	/* include any pages we added in add_ra-bio_pages */
 	cb->len = bio->bi_iter.bi_size;
 
+	/* Save bi_iter so that end_bio_extent_readpage() won't freak out. */
+	btrfs_bio_save_iter(btrfs_bio(bio));
+
 	while (cur_disk_byte < disk_bytenr + compressed_len) {
 		u64 offset = cur_disk_byte - disk_bytenr;
 		unsigned int index = offset >> PAGE_SHIFT;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5c598e124c25..76b3fbcb91eb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -817,6 +817,7 @@ static void run_one_async_done(struct btrfs_work *work)
 	/* If an error occurred we just want to clean up the bio and move on */
 	if (async->status) {
 		async->bio->bi_status = async->status;
+		btrfs_bio_save_iter(btrfs_bio(async->bio));
 		bio_endio(async->bio);
 		return;
 	}
@@ -949,6 +950,7 @@ blk_status_t btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio,
 
 out_w_error:
 	bio->bi_status = ret;
+	btrfs_bio_save_iter(btrfs_bio(bio));
 	bio_endio(bio);
 	return ret;
 }
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1a67f4b3986b..efd109caf95b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -175,6 +175,11 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 
 	/* Caller should ensure the bio has at least some range added */
 	ASSERT(bio->bi_iter.bi_size);
+	/*
+	 * This for later endio on errors, as later endio functions will rely
+	 * on btrfs_bio::iter.
+	 */
+	btrfs_bio_save_iter(btrfs_bio(bio));
 	if (is_data_inode(tree->private_data))
 		ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    bio_flags);
@@ -192,6 +197,7 @@ static void end_write_bio(struct extent_page_data *epd, int ret)
 
 	if (bio) {
 		bio->bi_status = errno_to_blk_status(ret);
+		btrfs_bio_save_iter(btrfs_bio(bio));
 		bio_endio(bio);
 		epd->bio_ctrl.bio = NULL;
 	}
@@ -3355,6 +3361,7 @@ static int alloc_new_bio(struct btrfs_inode *inode,
 error:
 	bio_ctrl->bio = NULL;
 	bio->bi_status = errno_to_blk_status(ret);
+	btrfs_bio_save_iter(btrfs_bio(bio));
 	bio_endio(bio);
 	return ret;
 }
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0e239a4c3b26..13e726c88a81 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1731,6 +1731,7 @@ int raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc,
 		return PTR_ERR(rbio);
 	}
 	bio_list_add(&rbio->bio_list, bio);
+	btrfs_bio_save_iter(btrfs_bio(bio));
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
 	rbio->operation = BTRFS_RBIO_WRITE;
 
@@ -2135,6 +2136,7 @@ int raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 
 	rbio->operation = BTRFS_RBIO_READ_REBUILD;
 	bio_list_add(&rbio->bio_list, bio);
+	btrfs_bio_save_iter(btrfs_bio(bio));
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
 	rbio->faila = find_logical_bio_stripe(rbio, bio);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f38c230111be..cdf5725f1f32 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6794,6 +6794,7 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	map_length = length;
 
 	btrfs_bio_counter_inc_blocked(fs_info);
+	btrfs_bio_save_iter(btrfs_bio(bio));
 	ret = __btrfs_map_block(fs_info, btrfs_op(bio), logical,
 				&map_length, &bioc, mirror_num, 1);
 	if (ret) {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 3b8130680749..c038fb1e36d5 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -334,6 +334,12 @@ struct btrfs_bio {
 	struct btrfs_device *device;
 	u8 *csum;
 	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
+	/*
+	 * Saved bio::bi_iter before submission.
+	 *
+	 * This allows us to interate the cloned/split bio properly, as at
+	 * endio time bio::bi_iter is no longer reliable.
+	 */
 	struct bvec_iter iter;
 
 	/*
@@ -356,6 +362,17 @@ static inline void btrfs_bio_free_csum(struct btrfs_bio *bbio)
 	}
 }
 
+/*
+ * To save bbio::bio->bi_iter into bbio::iter so for callers who need the
+ * original bi_iter can access the original part of the bio.
+ * This is especially important for the incoming split btrfs_bio, which needs
+ * to call its endio for and only for the split range.
+ */
+static inline void btrfs_bio_save_iter(struct btrfs_bio *bbio)
+{
+	bbio->iter = bbio->bio.bi_iter;
+}
+
 struct btrfs_io_stripe {
 	struct btrfs_device *dev;
 	u64 physical;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 03/17] btrfs: use correct bio size for error message in btrfs_end_dio_bio()
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 01/17] btrfs: update an stale comment on btrfs_submit_bio_hook() Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 02/17] btrfs: save bio::bi_iter into btrfs_bio::iter before any endio Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 04/17] btrfs: refactor btrfs_map_bio() Qu Wenruo
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

At endio time, bio->bi_iter is no longer valid (there are some cases
they are still valid, but never ensured).

Thus if we really want to get the full size of bio, we have to iterate
them.

In btrfs_end_dio_bio() when we hit error, we would grab bio size from
bi_iter which can be wrong.

Fix it by iterating the bvecs and calculate the bio size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6079d30f83e8..126d2117954c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8175,12 +8175,19 @@ static void btrfs_end_dio_bio(struct bio *bio)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	blk_status_t err = bio->bi_status;
 
-	if (err)
+	if (err) {
+		struct bvec_iter_all iter_all;
+		struct bio_vec *bvec;
+		u32 bi_size = 0;
+
+		bio_for_each_segment_all(bvec, bio, iter_all)
+			bi_size += bvec->bv_len;
+
 		btrfs_warn(BTRFS_I(dip->inode)->root->fs_info,
 			   "direct IO failed ino %llu rw %d,%u sector %#Lx len %u err no %d",
 			   btrfs_ino(BTRFS_I(dip->inode)), bio_op(bio),
-			   bio->bi_opf, bio->bi_iter.bi_sector,
-			   bio->bi_iter.bi_size, err);
+			   bio->bi_opf, bio->bi_iter.bi_sector, bi_size, err);
+	}
 
 	if (bio_op(bio) == REQ_OP_READ)
 		err = btrfs_check_read_dio_bio(dip, btrfs_bio(bio), !err);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 04/17] btrfs: refactor btrfs_map_bio()
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 03/17] btrfs: use correct bio size for error message in btrfs_end_dio_bio() Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 05/17] btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio() Qu Wenruo
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

Currently in btrfs_map_bio() we call __btrfs_map_block(), then using the
returned bioc to submit real stripes.

This is fine if we're only going to handle one bio a time.

For the incoming bio split at btrfs_map_bio() time, we want to handle
several different bios, thus there we introduce a new helper,
submit_one_mapped_range() to handle the submission part, making it much
easier to make it work in a loop.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 67 ++++++++++++++++++++++++++++------------------
 1 file changed, 41 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index cdf5725f1f32..1630a4d22122 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6777,30 +6777,15 @@ static void bioc_error(struct btrfs_io_context *bioc, struct bio *bio, u64 logic
 	}
 }
 
-blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
-			   int mirror_num)
+static int submit_one_mapped_range(struct btrfs_fs_info *fs_info, struct bio *bio,
+				   struct btrfs_io_context *bioc, u64 map_length,
+				   int mirror_num)
 {
-	struct btrfs_device *dev;
 	struct bio *first_bio = bio;
-	u64 logical = bio->bi_iter.bi_sector << 9;
-	u64 length = 0;
-	u64 map_length;
-	int ret;
-	int dev_nr;
+	u64 logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	int total_devs;
-	struct btrfs_io_context *bioc = NULL;
-
-	length = bio->bi_iter.bi_size;
-	map_length = length;
-
-	btrfs_bio_counter_inc_blocked(fs_info);
-	btrfs_bio_save_iter(btrfs_bio(bio));
-	ret = __btrfs_map_block(fs_info, btrfs_op(bio), logical,
-				&map_length, &bioc, mirror_num, 1);
-	if (ret) {
-		btrfs_bio_counter_dec(fs_info);
-		return errno_to_blk_status(ret);
-	}
+	int dev_nr;
+	int ret;
 
 	total_devs = bioc->num_stripes;
 	bioc->orig_bio = first_bio;
@@ -6819,18 +6804,19 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 						    mirror_num, 1);
 		}
 
-		btrfs_bio_counter_dec(fs_info);
-		return errno_to_blk_status(ret);
+		return ret;
 	}
 
-	if (map_length < length) {
+	if (map_length < bio->bi_iter.bi_size) {
 		btrfs_crit(fs_info,
-			   "mapping failed logical %llu bio len %llu len %llu",
-			   logical, length, map_length);
+			   "mapping failed logical %llu bio len %u len %llu",
+			   logical, bio->bi_iter.bi_size, map_length);
 		BUG();
 	}
 
 	for (dev_nr = 0; dev_nr < total_devs; dev_nr++) {
+		struct btrfs_device *dev;
+
 		dev = bioc->stripes[dev_nr].dev;
 		if (!dev || !dev->bdev || test_bit(BTRFS_DEV_STATE_MISSING,
 						   &dev->dev_state) ||
@@ -6847,6 +6833,35 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 
 		submit_stripe_bio(bioc, bio, bioc->stripes[dev_nr].physical, dev);
 	}
+	return 0;
+}
+
+blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+			   int mirror_num)
+{
+	u64 logical = bio->bi_iter.bi_sector << 9;
+	u64 length = 0;
+	u64 map_length;
+	int ret;
+	struct btrfs_io_context *bioc = NULL;
+
+	length = bio->bi_iter.bi_size;
+	map_length = length;
+
+	btrfs_bio_counter_inc_blocked(fs_info);
+	btrfs_bio_save_iter(btrfs_bio(bio));
+	ret = __btrfs_map_block(fs_info, btrfs_op(bio), logical,
+				&map_length, &bioc, mirror_num, 1);
+	if (ret) {
+		btrfs_bio_counter_dec(fs_info);
+		return errno_to_blk_status(ret);
+	}
+
+	ret = submit_one_mapped_range(fs_info, bio, bioc, map_length, mirror_num);
+	if (ret < 0) {
+		btrfs_bio_counter_dec(fs_info);
+		return errno_to_blk_status(ret);
+	}
 	btrfs_bio_counter_dec(fs_info);
 	return BLK_STS_OK;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 05/17] btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio()
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 04/17] btrfs: refactor btrfs_map_bio() Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 06/17] btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes Qu Wenruo
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

This is a preparation patch for the incoming chunk mapping layer bio
split.

Function btrfs_bio_wq_end_io() is going to remap bio::bi_private and
bio::bi_end_io so that the real endio function will be executed in a
workqueue.

The problem is, remapped bio::bi_private will be a newly allocated
memory, and after the original endio executed, the memory will be freed.

This will not work well with split bio.

So this patch will move all btrfs_bio_wq_end_io() call into one helper
function, btrfs_bio_final_endio_remap(), and call that helper in
submit_stripe_bio().

This refactor also unified all data bio behaviors.

Before this patch, compressed bio no matter if read or write, will
always be delayed using workqueue.

However all data write operations are already delayed using ordered
extent, and all metadata write doesn't need any delayed execution.

Thus this patch will make compressed bios follow the same data
read/write behavior.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c |  4 +---
 fs/btrfs/disk-io.c     |  9 +--------
 fs/btrfs/inode.c       | 20 +++++---------------
 fs/btrfs/volumes.c     | 41 +++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h     |  9 ++++++++-
 5 files changed, 52 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index cc8d13369f53..8668c5190805 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -429,10 +429,8 @@ static blk_status_t submit_compressed_bio(struct btrfs_fs_info *fs_info,
 {
 	blk_status_t ret;
 
+	btrfs_bio(bio)->endio_type = BTRFS_WQ_ENDIO_DATA;
 	ASSERT(bio->bi_iter.bi_size);
-	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
-	if (ret)
-		return ret;
 	ret = btrfs_map_bio(fs_info, bio, mirror_num);
 	return ret;
 }
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 76b3fbcb91eb..d6e89822191b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -921,14 +921,7 @@ blk_status_t btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio,
 	blk_status_t ret;
 
 	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
-		/*
-		 * called for a read, do the setup so that checksum validation
-		 * can happen in the async kernel threads
-		 */
-		ret = btrfs_bio_wq_end_io(fs_info, bio,
-					  BTRFS_WQ_ENDIO_METADATA);
-		if (ret)
-			goto out_w_error;
+		btrfs_bio(bio)->endio_type = BTRFS_WQ_ENDIO_METADATA;
 		ret = btrfs_map_bio(fs_info, bio, mirror_num);
 	} else if (!should_async_write(fs_info, BTRFS_I(inode))) {
 		ret = btree_csum_one_bio(bio);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 126d2117954c..007a20a9b076 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2511,7 +2511,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	enum btrfs_wq_endio_type metadata = BTRFS_WQ_ENDIO_DATA;
+	enum btrfs_wq_endio_type endio_type = BTRFS_WQ_ENDIO_DATA;
 	blk_status_t ret = 0;
 	int skip_sum;
 	int async = !atomic_read(&BTRFS_I(inode)->sync_writers);
@@ -2520,7 +2520,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 		test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state);
 
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
-		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
+		endio_type = BTRFS_WQ_ENDIO_FREE_SPACE;
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 		struct page *page = bio_first_bvec_all(bio)->bv_page;
@@ -2532,10 +2532,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	}
 
 	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
-		ret = btrfs_bio_wq_end_io(fs_info, bio, metadata);
-		if (ret)
-			goto out;
-
+		btrfs_bio(bio)->endio_type = endio_type;
 		if (bio_flags & EXTENT_BIO_COMPRESSED) {
 			ret = btrfs_submit_compressed_read(inode, bio,
 							   mirror_num,
@@ -8090,10 +8087,6 @@ static blk_status_t submit_dio_repair_bio(struct inode *inode, struct bio *bio,
 
 	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
 
-	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
-	if (ret)
-		return ret;
-
 	refcount_inc(&dip->refs);
 	ret = btrfs_map_bio(fs_info, bio, mirror_num);
 	if (ret)
@@ -8219,11 +8212,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 	if (async_submit)
 		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers);
 
-	if (!write) {
-		ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
-		if (ret)
-			goto err;
-	}
+	if (!write)
+		btrfs_bio(bio)->endio_type = BTRFS_WQ_ENDIO_DATA;
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
 		goto map;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1630a4d22122..fba08cfcbd4e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6725,10 +6725,31 @@ static void btrfs_end_bio(struct bio *bio)
 	}
 }
 
-static void submit_stripe_bio(struct btrfs_io_context *bioc, struct bio *bio,
-			      u64 physical, struct btrfs_device *dev)
+/*
+ * Endio remaps which can't handle cloned bio needs to go here.
+ *
+ * Currently it's only btrfs_bio_wq_end_io().
+ */
+static int btrfs_bio_final_endio_remap(struct btrfs_fs_info *fs_info,
+				       struct bio *bio)
+{
+	blk_status_t sts;
+
+	/* For write bio, we don't to put their endio into wq */
+	if (btrfs_op(bio) == BTRFS_MAP_WRITE)
+		return 0;
+
+	sts = btrfs_bio_wq_end_io(fs_info, bio, btrfs_bio(bio)->endio_type);
+	if (sts != BLK_STS_OK)
+		return blk_status_to_errno(sts);
+	return 0;
+}
+
+static int submit_stripe_bio(struct btrfs_io_context *bioc, struct bio *bio,
+			     u64 physical, struct btrfs_device *dev)
 {
 	struct btrfs_fs_info *fs_info = bioc->fs_info;
+	int ret;
 
 	bio->bi_private = bioc;
 	btrfs_bio(bio)->device = dev;
@@ -6755,9 +6776,14 @@ static void submit_stripe_bio(struct btrfs_io_context *bioc, struct bio *bio,
 		dev->devid, bio->bi_iter.bi_size);
 	bio_set_dev(bio, dev->bdev);
 
-	btrfs_bio_counter_inc_noblocked(fs_info);
+	/* Do the final endio remap if needed */
+	ret = btrfs_bio_final_endio_remap(fs_info, bio);
+	if (ret < 0)
+		return ret;
 
+	btrfs_bio_counter_inc_noblocked(fs_info);
 	btrfsic_submit_bio(bio);
+	return ret;
 }
 
 static void bioc_error(struct btrfs_io_context *bioc, struct bio *bio, u64 logical)
@@ -6831,9 +6857,16 @@ static int submit_one_mapped_range(struct btrfs_fs_info *fs_info, struct bio *bi
 		else
 			bio = first_bio;
 
-		submit_stripe_bio(bioc, bio, bioc->stripes[dev_nr].physical, dev);
+		ret = submit_stripe_bio(bioc, bio,
+					bioc->stripes[dev_nr].physical, dev);
+		if (ret < 0)
+			goto error;
 	}
 	return 0;
+error:
+	for (; dev_nr < total_devs; dev_nr++)
+		bioc_error(bioc, first_bio, logical);
+	return ret;
 }
 
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index c038fb1e36d5..b2081b03990a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -328,7 +328,14 @@ struct btrfs_fs_devices {
  * Mostly for btrfs specific features like csum and mirror_num.
  */
 struct btrfs_bio {
-	unsigned int mirror_num;
+	u16 mirror_num;
+
+	/*
+	 * To tell which workqueue the bio's endio should be exeucted in.
+	 *
+	 * Only for read bios.
+	 */
+	u16 endio_type;
 
 	/* @device is for stripe IO submission. */
 	struct btrfs_device *device;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 06/17] btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (4 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 05/17] btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio() Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-09 10:02   ` Johannes Thumshirn
  2021-12-06  2:29 ` [PATCH v2 07/17] btrfs: introduce btrfs_bio_split() helper Qu Wenruo
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

This mostly follows the behavior of compressed_bio::pending_sectors.

The point here is, dip::refs is not split bio friendly, as if a bio with
its bi_private = dip, and the bio get split, we can easily underflow
dip::refs.

By using the same sector based solution as compressed_bio, dio can
handle both unsplit and split bios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/btrfs_inode.h | 10 +++----
 fs/btrfs/inode.c       | 67 +++++++++++++++++++++---------------------
 2 files changed, 38 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index b3e46aabc3d8..196f74ee102e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -358,11 +358,11 @@ struct btrfs_dio_private {
 	/* Used for bio::bi_size */
 	u32 bytes;
 
-	/*
-	 * References to this structure. There is one reference per in-flight
-	 * bio plus one while we're still setting up.
-	 */
-	refcount_t refs;
+	/* Hit any error for the whole DIO bio */
+	bool errors;
+
+	/* How many bytes are still under IO or not submitted */
+	atomic_t pending_bytes;
 
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 007a20a9b076..1aa060de917c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8053,20 +8053,28 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	return ret;
 }
 
-static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
+static bool dec_and_test_dio_private(struct btrfs_dio_private *dip, bool error,
+	    			     u32 bytes)
 {
-	/*
-	 * This implies a barrier so that stores to dio_bio->bi_status before
-	 * this and loads of dio_bio->bi_status after this are fully ordered.
-	 */
-	if (!refcount_dec_and_test(&dip->refs))
+	ASSERT(bytes <= dip->bytes);
+	ASSERT(bytes <= atomic_read(&dip->pending_bytes));
+
+	if (error)
+		dip->errors = true;
+	return atomic_sub_and_test(bytes, &dip->pending_bytes);
+}
+
+static void dio_private_finish(struct btrfs_dio_private *dip, bool error,
+			       u32 bytes)
+{
+	if (!dec_and_test_dio_private(dip, error, bytes))
 		return;
 
 	if (btrfs_op(dip->dio_bio) == BTRFS_MAP_WRITE) {
 		__endio_write_update_ordered(BTRFS_I(dip->inode),
 					     dip->file_offset,
 					     dip->bytes,
-					     !dip->dio_bio->bi_status);
+					     !dip->errors);
 	} else {
 		unlock_extent(&BTRFS_I(dip->inode)->io_tree,
 			      dip->file_offset,
@@ -8087,10 +8095,10 @@ static blk_status_t submit_dio_repair_bio(struct inode *inode, struct bio *bio,
 
 	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
 
-	refcount_inc(&dip->refs);
+	atomic_add(bio->bi_iter.bi_size, &dip->pending_bytes);
 	ret = btrfs_map_bio(fs_info, bio, mirror_num);
 	if (ret)
-		refcount_dec(&dip->refs);
+		atomic_sub(bio->bi_iter.bi_size, &dip->pending_bytes);
 	return ret;
 }
 
@@ -8166,20 +8174,20 @@ static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
 static void btrfs_end_dio_bio(struct bio *bio)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	u32 bi_size = 0;
 	blk_status_t err = bio->bi_status;
 
-	if (err) {
-		struct bvec_iter_all iter_all;
-		struct bio_vec *bvec;
-		u32 bi_size = 0;
-
-		bio_for_each_segment_all(bvec, bio, iter_all)
-			bi_size += bvec->bv_len;
+	__bio_for_each_segment(bvec, bio, iter, btrfs_bio(bio)->iter)
+		bi_size += bvec.bv_len;
 
+	if (err) {
 		btrfs_warn(BTRFS_I(dip->inode)->root->fs_info,
 			   "direct IO failed ino %llu rw %d,%u sector %#Lx len %u err no %d",
 			   btrfs_ino(BTRFS_I(dip->inode)), bio_op(bio),
 			   bio->bi_opf, bio->bi_iter.bi_sector, bi_size, err);
+		dip->errors = true;
 	}
 
 	if (bio_op(bio) == REQ_OP_READ)
@@ -8191,7 +8199,7 @@ static void btrfs_end_dio_bio(struct bio *bio)
 	btrfs_record_physical_zoned(dip->inode, dip->file_offset, bio);
 
 	bio_put(bio);
-	btrfs_dio_private_put(dip);
+	dio_private_finish(dip, err, bi_size);
 }
 
 static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
@@ -8250,7 +8258,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
  */
 static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 							  struct inode *inode,
-							  loff_t file_offset)
+							  loff_t file_offset,
+							  u32 length)
 {
 	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
@@ -8270,12 +8279,12 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 	if (!dip)
 		return NULL;
 
+	atomic_set(&dip->pending_bytes, length);
 	dip->inode = inode;
 	dip->file_offset = file_offset;
 	dip->bytes = dio_bio->bi_iter.bi_size;
 	dip->disk_bytenr = dio_bio->bi_iter.bi_sector << 9;
 	dip->dio_bio = dio_bio;
-	refcount_set(&dip->refs, 1);
 	return dip;
 }
 
@@ -8289,6 +8298,8 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 			     BTRFS_BLOCK_GROUP_RAID56_MASK);
 	struct btrfs_dio_private *dip;
 	struct bio *bio;
+	const u32 length = dio_bio->bi_iter.bi_size;
+	u32 submitted_bytes = 0;
 	u64 start_sector;
 	int async_submit = 0;
 	u64 submit_len;
@@ -8301,7 +8312,7 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 	struct btrfs_dio_data *dio_data = iter->iomap.private;
 	struct extent_map *em = NULL;
 
-	dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
+	dip = btrfs_create_dio_private(dio_bio, inode, file_offset, length);
 	if (!dip) {
 		if (!write) {
 			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
@@ -8311,7 +8322,6 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 		bio_endio(dio_bio);
 		return;
 	}
-
 	if (!write) {
 		/*
 		 * Load the csums up front to reduce csum tree searches and
@@ -8365,17 +8375,7 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 		ASSERT(submit_len >= clone_len);
 		submit_len -= clone_len;
 
-		/*
-		 * Increase the count before we submit the bio so we know
-		 * the end IO handler won't happen before we increase the
-		 * count. Otherwise, the dip might get freed before we're
-		 * done setting it up.
-		 *
-		 * We transfer the initial reference to the last bio, so we
-		 * don't need to increment the reference count for the last one.
-		 */
 		if (submit_len > 0) {
-			refcount_inc(&dip->refs);
 			/*
 			 * If we are submitting more than one bio, submit them
 			 * all asynchronously. The exception is RAID 5 or 6, as
@@ -8390,11 +8390,10 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 						async_submit);
 		if (status) {
 			bio_put(bio);
-			if (submit_len > 0)
-				refcount_dec(&dip->refs);
 			goto out_err_em;
 		}
 
+		submitted_bytes += clone_len;
 		dio_data->submitted += clone_len;
 		clone_offset += clone_len;
 		start_sector += clone_len >> 9;
@@ -8408,7 +8407,7 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 	free_extent_map(em);
 out_err:
 	dip->dio_bio->bi_status = status;
-	btrfs_dio_private_put(dip);
+	dio_private_finish(dip, status, length - submitted_bytes);
 }
 
 const struct iomap_ops btrfs_dio_iomap_ops = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 07/17] btrfs: introduce btrfs_bio_split() helper
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (5 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 06/17] btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 08/17] btrfs: make data buffered read path to handle split bio properly Qu Wenruo
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

This new function will handle the split of a btrfs bio, to co-operate
with the incoming chunk mapping time bio split.

This patch will introduce the following new members and functions:

- btrfs_bio::offset_to_original
  Since btrfs_bio::csum is still storing the checksum for the original
  logical bytenr, we need to know the offset between current advanced
  bio and the original logical bytenr.

  Thus here we need such new member.
  And the new member will fit into the existing hole between
  btrfs_bio::mirror_num and btrfs_bio::device, it should not increase
  the memory usage of btrfs_bio.

- btrfs_bio::parent and btrfs_bio::orig_endio
  To record where the parent bio is and the original endio function.

- btrfs_bio::is_split_bio
  To distinguish bio created by btrfs_bio_split() and
  btrfs_bio_clone*().

  For cloned bio, they still have their csum pointed to correct memory,
  while split bio must rely on its parent bbio to grab csum pointer.

- split_bio_endio()
  Just to call the original endio function then call bio_endio() on
  the original bio.
  This will ensure the original bio is freed after all cloned bio.

- btrfs_split_bio()
  Split the original bio into two, the behavior is pretty much the same
  as bio_split(), just with extra btrfs specific setup.

Currently there is no other caller utilizing above new members/functions
yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 82 +++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/extent_io.h |  2 ++
 fs/btrfs/volumes.h   | 43 +++++++++++++++++++++--
 3 files changed, 123 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index efd109caf95b..095bdc4775e7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3011,7 +3011,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 	int ret;
 	struct bvec_iter_all iter_all;
 
-	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		bool uptodate = !bio->bi_status;
 		struct page *page = bvec->bv_page;
@@ -3190,6 +3189,87 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size)
 	return bio;
 }
 
+/*
+ * A very simple wrapper to call original endio function and then
+ * call bio_endio() on the parent bio to decrease its bi_remaining count.
+ */
+static void split_bio_endio(struct bio *bio)
+{
+	struct btrfs_bio *bbio = btrfs_bio(bio);
+	/* After endio bbio could be freed, thus grab the info before endio */
+	struct bio *parent = bbio->parent;
+
+	/*
+	 * BIO_CLONED can even be set for our parent bio (DIO use clones
+	 * the initial bio, then uses the cloned one for IO).
+	 * So here we don't check BIO_CLONED for parent.
+	 */
+	ASSERT(bio_flagged(bio, BIO_CLONED) && bbio->is_split_bio);
+	ASSERT(parent && !btrfs_bio(parent)->is_split_bio);
+
+	bio->bi_end_io = bbio->orig_endio;
+	bio_endio(bio);
+	bio_endio(parent);
+}
+
+/*
+ * Pretty much like bio_split(), caller needs to ensure @src is not freed
+ * before the newly allocated bio, as the new bio is relying on @src for
+ * its bvecs.
+ */
+struct bio *btrfs_bio_split(struct btrfs_fs_info *fs_info,
+			    struct bio *src, unsigned int bytes)
+{
+	struct bio *new;
+	struct btrfs_bio *src_bbio = btrfs_bio(src);
+	struct btrfs_bio *new_bbio;
+	const unsigned int old_offset = src_bbio->offset_to_original;
+
+	/* Src should not be split */
+	ASSERT(!src_bbio->is_split_bio);
+	ASSERT(IS_ALIGNED(bytes, fs_info->sectorsize));
+	ASSERT(bytes < src->bi_iter.bi_size);
+
+	/*
+	 * We're in fact chaining the new bio to the parent, but we still want
+	 * to have independent bi_private/bi_endio, thus we need to manually
+	 * increase the remaining for the source, just like bio_chain().
+	 */
+	bio_inc_remaining(src);
+
+	/* Bioset backed split should not fail */
+	new = bio_split(src, bytes >> SECTOR_SHIFT, GFP_NOFS, &btrfs_bioset);
+	new_bbio = btrfs_bio(new);
+	new_bbio->offset_to_original = old_offset;
+	new_bbio->iter = new->bi_iter;
+	new_bbio->orig_endio = src->bi_end_io;
+	new_bbio->parent = src;
+	new_bbio->endio_type = src_bbio->endio_type;
+	new_bbio->is_split_bio = 1;
+	new->bi_end_io = split_bio_endio;
+
+	/*
+	 * This is very tricky, as if any endio has extra refcount on
+	 * bi_private, we will be screwed up.
+	 *
+	 * We workaround this hacky behavior by reviewing all the involved
+	 * endio stacks. Making sure only split-safe endio remap are called.
+	 *
+	 * Split-unsafe endio remap like btrfs_bio_wq_end_io() will be called
+	 * after btrfs_bio_split().
+	 */
+	new->bi_private = src->bi_private;
+
+	src_bbio->offset_to_original += bytes;
+
+	/*
+	 * For direct IO, @src is a cloned bio thus bbio::iter still points to
+	 * the full bio. Need to update it too.
+	 */
+	src_bbio->iter = src->bi_iter;
+	return new;
+}
+
 /**
  * Attempt to add a page to bio
  *
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 0399cf8e3c32..cb727b77ecda 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -280,6 +280,8 @@ void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 struct bio *btrfs_bio_alloc(unsigned int nr_iovecs);
 struct bio *btrfs_bio_clone(struct bio *bio);
 struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size);
+struct bio *btrfs_bio_split(struct btrfs_fs_info *fs_info,
+			    struct bio *src, unsigned int bytes);
 
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b2081b03990a..462b32c89abc 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -332,15 +332,52 @@ struct btrfs_bio {
 
 	/*
 	 * To tell which workqueue the bio's endio should be exeucted in.
+	 * This member is to make sure btrfs_bio_wq_end_io() is the last
+	 * endio remap in the stack.
 	 *
 	 * Only for read bios.
 	 */
-	u16 endio_type;
+	u8 endio_type;
+
+	/*
+	 * To tell if this btrfs bio is split or just cloned.
+	 * Both btrfs_bio_clone*() and btrfs_bio_split() will make bbio->bio
+	 * to have BIO_CLONED flag.
+	 * But cloned bio still has its bbio::csum pointed to correct memory,
+	 * unlike split bio relies on its parent bbio to grab csum.
+	 *
+	 * Thus we needs this extra flag to distinguish those cloned bio.
+	 */
+	u8 is_split_bio;
+
+	/*
+	 * Records the offset we're from the original bio.
+	 *
+	 * Since btrfs_bio can be split, but our csum is alwasy for the
+	 * original logical bytenr, we need a way to know the bytes offset
+	 * from the original logical bytenr to do proper csum verification.
+	 */
+	unsigned int offset_to_original;
 
 	/* @device is for stripe IO submission. */
 	struct btrfs_device *device;
-	u8 *csum;
-	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
+
+	union {
+		/*
+		 * For the parent bio recording the csum for the original
+		 * logical bytenr
+		 */
+		struct {
+			u8 *csum;
+			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
+		};
+
+		/* For child (split) bio to record where its parent is */
+		struct {
+			struct bio *parent;
+			bio_end_io_t *orig_endio;
+		};
+	};
 	/*
 	 * Saved bio::bi_iter before submission.
 	 *
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 08/17] btrfs: make data buffered read path to handle split bio properly
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (6 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 07/17] btrfs: introduce btrfs_bio_split() helper Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 09/17] btrfs: make data buffered write endio function to be split bio compatible Qu Wenruo
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

This involves the following modifications:

- Use bio_for_each_segment() instead of bio_for_each_segment_all()
  bio_for_each_segment_all() will iterate all bvecs, even if they are
  not referred by current bi_iter.

  *_all() variant can only be used if the bio is never split.

  Change it to __bio_for_each_segment() call so we won't have endio called
  on the same range by both split and parent bios, and it can handle
  both split and unsplit bios.

- Make check_data_csum() to take bbio->offset_to_original into
  consideration
  Since btrfs bio can be split now, split/original bio can all start
  with some offset to the original logical bytenr.

  Take btrfs_bio::offset_to_original into consideration to get correct
  checksum offset.

- Remove the BIO_CLONED ASSERT() in submit_read_repair()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 34 +++++++++++++++++++---------------
 fs/btrfs/inode.c     | 23 +++++++++++++++++++++--
 fs/btrfs/volumes.h   |  3 ++-
 3 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 095bdc4775e7..049da3811bae 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2741,10 +2741,9 @@ static blk_status_t submit_read_repair(struct inode *inode,
 	ASSERT(error_bitmap);
 
 	/*
-	 * We only get called on buffered IO, thus page must be mapped and bio
-	 * must not be cloned.
-	 */
-	ASSERT(page->mapping && !bio_flagged(failed_bio, BIO_CLONED));
+	 * We only get called on buffered IO, thus page must be mapped
+	*/
+	ASSERT(page->mapping);
 
 	/* Iterate through all the sectors in the range */
 	for (i = 0; i < nr_bits; i++) {
@@ -2998,7 +2997,8 @@ static struct extent_buffer *find_extent_buffer_readpage(
  */
 static void end_bio_extent_readpage(struct bio *bio)
 {
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
+	struct bvec_iter iter;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	struct extent_io_tree *tree, *failure_tree;
 	struct processed_extent processed = { 0 };
@@ -3009,11 +3009,15 @@ static void end_bio_extent_readpage(struct bio *bio)
 	u32 bio_offset = 0;
 	int mirror;
 	int ret;
-	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
+	/*
+	 * We should have saved the orignal bi_iter, and then start iterating
+	 * using that saved iter, as at endio time bi_iter is not reliable.
+	 */
+	ASSERT(bbio->iter.bi_size);
+	__bio_for_each_segment(bvec, bio, iter, bbio->iter) {
 		bool uptodate = !bio->bi_status;
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
@@ -3036,19 +3040,19 @@ static void end_bio_extent_readpage(struct bio *bio)
 		 * for unaligned offsets, and an error if they don't add up to
 		 * a full sector.
 		 */
-		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+		if (!IS_ALIGNED(bvec.bv_offset, sectorsize))
 			btrfs_err(fs_info,
 		"partial page read in btrfs with offset %u and length %u",
-				  bvec->bv_offset, bvec->bv_len);
-		else if (!IS_ALIGNED(bvec->bv_offset + bvec->bv_len,
+				  bvec.bv_offset, bvec.bv_len);
+		else if (!IS_ALIGNED(bvec.bv_offset + bvec.bv_len,
 				     sectorsize))
 			btrfs_info(fs_info,
 		"incomplete page read with offset %u and length %u",
-				   bvec->bv_offset, bvec->bv_len);
+				   bvec.bv_offset, bvec.bv_len);
 
-		start = page_offset(page) + bvec->bv_offset;
-		end = start + bvec->bv_len - 1;
-		len = bvec->bv_len;
+		start = page_offset(page) + bvec.bv_offset;
+		end = start + bvec.bv_len - 1;
+		len = bvec.bv_len;
 
 		mirror = bbio->mirror_num;
 		if (likely(uptodate)) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1aa060de917c..186304c69900 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3225,6 +3225,24 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 				       finish_ordered_fn, uptodate);
 }
 
+static u8 *bbio_get_real_csum(struct btrfs_fs_info *fs_info,
+			      struct btrfs_bio *bbio)
+{
+	u8 *ret;
+
+	/* Split bbio needs to grab csum from its parent */
+	if (bbio->is_split_bio)
+		ret = btrfs_bio(bbio->parent)->csum;
+	else
+		ret = bbio->csum;
+
+	if (ret == NULL)
+		return ret;
+
+	return ret + (bbio->offset_to_original >> fs_info->sectorsize_bits) *
+		     fs_info->csum_size;
+}
+
 /*
  * check_data_csum - verify checksum of one sector of uncompressed data
  * @inode:	inode
@@ -3252,7 +3270,8 @@ static int check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
 	ASSERT(pgoff + len <= PAGE_SIZE);
 
 	offset_sectors = bio_offset >> fs_info->sectorsize_bits;
-	csum_expected = ((u8 *)bbio->csum) + offset_sectors * csum_size;
+	csum_expected = bbio_get_real_csum(fs_info, bbio) +
+			offset_sectors * csum_size;
 
 	kaddr = kmap_atomic(page);
 	shash->tfm = fs_info->csum_shash;
@@ -3310,7 +3329,7 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
 	 * Normally this should be covered by above check for compressed read
 	 * or the next check for NODATASUM.  Just do a quicker exit here.
 	 */
-	if (bbio->csum == NULL)
+	if (bbio_get_real_csum(fs_info, bbio) == NULL)
 		return 0;
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 462b32c89abc..a7f3fd4b4226 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -400,7 +400,8 @@ static inline struct btrfs_bio *btrfs_bio(struct bio *bio)
 
 static inline void btrfs_bio_free_csum(struct btrfs_bio *bbio)
 {
-	if (bbio->csum != bbio->csum_inline) {
+	/* Only free the csum if we're not a split bio */
+	if (!bbio->is_split_bio && bbio->csum != bbio->csum_inline) {
 		kfree(bbio->csum);
 		bbio->csum = NULL;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 09/17] btrfs: make data buffered write endio function to be split bio compatible
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (7 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 08/17] btrfs: make data buffered read path to handle split bio properly Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 10/17] btrfs: make metadata write endio functions " Qu Wenruo
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

Only need to change the bio_for_each_segment_all() call to
__bio_for_each_segment() call, and using btrfs_bio::iter as the initial
bi_iter.

Now the endio function can handle both split and unsplit bios well.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 049da3811bae..952789ed650d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2833,31 +2833,31 @@ void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
 static void end_bio_extent_writepage(struct bio *bio)
 {
 	int error = blk_status_to_errno(bio->bi_status);
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
+	struct bvec_iter iter;
 	u64 start;
 	u64 end;
-	struct bvec_iter_all iter_all;
 	bool first_bvec = true;
 
-	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+	ASSERT(btrfs_bio(bio)->iter.bi_size);
+	__bio_for_each_segment(bvec, bio, iter, btrfs_bio(bio)->iter) {
+		struct page *page = bvec.bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
 
 		/* Our read/write should always be sector aligned. */
-		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+		if (!IS_ALIGNED(bvec.bv_offset, sectorsize))
 			btrfs_err(fs_info,
 		"partial page write in btrfs with offset %u and length %u",
-				  bvec->bv_offset, bvec->bv_len);
-		else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+				  bvec.bv_offset, bvec.bv_len);
+		else if (!IS_ALIGNED(bvec.bv_len, sectorsize))
 			btrfs_info(fs_info,
 		"incomplete page write with offset %u and length %u",
-				   bvec->bv_offset, bvec->bv_len);
+				   bvec.bv_offset, bvec.bv_len);
 
-		start = page_offset(page) + bvec->bv_offset;
-		end = start + bvec->bv_len - 1;
+		start = page_offset(page) + bvec.bv_offset;
+		end = start + bvec.bv_len - 1;
 
 		if (first_bvec) {
 			btrfs_record_physical_zoned(inode, start, bio);
@@ -2866,7 +2866,7 @@ static void end_bio_extent_writepage(struct bio *bio)
 
 		end_extent_writepage(page, error, start, end);
 
-		btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
+		btrfs_page_clear_writeback(fs_info, page, start, bvec.bv_len);
 	}
 
 	bio_put(bio);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 10/17] btrfs: make metadata write endio functions to be split bio compatible
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (8 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 09/17] btrfs: make data buffered write endio function to be split bio compatible Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 11/17] btrfs: make dec_and_test_compressed_bio() " Qu Wenruo
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

Only need to convert the bio_for_each_segment_all() call into
__bio_for_each_segment() call and using btrfs_bio::iter as the initial
iterator.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 952789ed650d..cb99b55ccf87 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4502,20 +4502,20 @@ static struct extent_buffer *find_extent_buffer_nolock(
 static void end_bio_subpage_eb_writepage(struct bio *bio)
 {
 	struct btrfs_fs_info *fs_info;
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
 
 	fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
 	ASSERT(fs_info->sectorsize < PAGE_SIZE);
 
-	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
-		u64 bvec_start = page_offset(page) + bvec->bv_offset;
-		u64 bvec_end = bvec_start + bvec->bv_len - 1;
+	ASSERT(btrfs_bio(bio)->iter.bi_size);
+	__bio_for_each_segment(bvec, bio, iter, btrfs_bio(bio)->iter) {
+		struct page *page = bvec.bv_page;
+		u64 bvec_start = page_offset(page) + bvec.bv_offset;
+		u64 bvec_end = bvec_start + bvec.bv_len - 1;
 		u64 cur_bytenr = bvec_start;
 
-		ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+		ASSERT(IS_ALIGNED(bvec.bv_len, fs_info->nodesize));
 
 		/* Iterate through all extent buffers in the range */
 		while (cur_bytenr <= bvec_end) {
@@ -4558,14 +4558,14 @@ static void end_bio_subpage_eb_writepage(struct bio *bio)
 
 static void end_bio_extent_buffer_writepage(struct bio *bio)
 {
-	struct bio_vec *bvec;
 	struct extent_buffer *eb;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
 	int done;
-	struct bvec_iter_all iter_all;
 
-	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+	ASSERT(btrfs_bio(bio)->iter.bi_size);
+	__bio_for_each_segment(bvec, bio, iter, btrfs_bio(bio)->iter) {
+		struct page *page = bvec.bv_page;
 
 		eb = (struct extent_buffer *)page->private;
 		BUG_ON(!eb);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 11/17] btrfs: make dec_and_test_compressed_bio() to be split bio compatible
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (9 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 10/17] btrfs: make metadata write endio functions " Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 12/17] btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block() Qu Wenruo
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

For compression read write endio functions, they all rely on
dec_and_test_compressed_bio() to determine if they are the last bio.

So here we only need to convert the bio_for_each_segment_all() call into
__bio_for_each_segment() so that compression read/write endio functions
will handle both split and unsplit bios well.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 8668c5190805..8b4b84b59b0c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -205,18 +205,14 @@ static int check_compressed_csum(struct btrfs_inode *inode, struct bio *bio,
 static bool dec_and_test_compressed_bio(struct compressed_bio *cb, struct bio *bio)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
+	struct bio_vec bvec;
+	struct bvec_iter iter;
 	unsigned int bi_size = 0;
 	bool last_io = false;
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
 
-	/*
-	 * At endio time, bi_iter.bi_size doesn't represent the real bio size.
-	 * Thus here we have to iterate through all segments to grab correct
-	 * bio size.
-	 */
-	bio_for_each_segment_all(bvec, bio, iter_all)
-		bi_size += bvec->bv_len;
+	ASSERT(btrfs_bio(bio)->iter.bi_size);
+	__bio_for_each_segment(bvec, bio, iter, btrfs_bio(bio)->iter)
+		bi_size += bvec.bv_len;
 
 	if (bio->bi_status)
 		cb->errors = 1;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 12/17] btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block()
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (10 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 11/17] btrfs: make dec_and_test_compressed_bio() " Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 13/17] btrfs: allow btrfs_map_bio() to split bio according to chunk stripe boundaries Qu Wenruo
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

For profiles other than RAID56, __btrfs_map_block() returns @map_length
as min(stripe_end, logical + *length), which is also the same result
from btrfs_get_io_geometry().

But for RAID56, __btrfs_map_block() returns @map_length as stripe_len.

This strange behavior is going to hurt incoming bio split at
btrfs_map_bio() time, as we will use @map_length as bio split size.

Fix this behavior by:

- Return @map_length by the same calculatioin as other profiles

- Save stripe_len into btrfs_io_context

- Pass btrfs_io_context::stripe_len to raid56_*() functions

- Update raid56_*() functions to make its stripe_len parameter more
  explicit

- Update scrub_stripe_index_and_offset() to properly name its
  parameters

- Add extra ASSERT()s to make sure the passed stripe_len is correct

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c  | 12 ++++++++++--
 fs/btrfs/raid56.h  |  2 +-
 fs/btrfs/scrub.c   | 14 ++++++++------
 fs/btrfs/volumes.c | 13 ++++++++++---
 fs/btrfs/volumes.h |  1 +
 5 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 13e726c88a81..d35cfd750b76 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -969,6 +969,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
 	void *p;
 
+	ASSERT(stripe_len == BTRFS_STRIPE_LEN);
+
 	rbio = kzalloc(sizeof(*rbio) +
 		       sizeof(*rbio->stripe_pages) * num_pages +
 		       sizeof(*rbio->bio_pages) * num_pages +
@@ -1725,6 +1727,9 @@ int raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc,
 	struct blk_plug_cb *cb;
 	int ret;
 
+	/* Currently we only support fixed stripe len */
+	ASSERT(stripe_len == BTRFS_STRIPE_LEN);
+
 	rbio = alloc_rbio(fs_info, bioc, stripe_len);
 	if (IS_ERR(rbio)) {
 		btrfs_put_bioc(bioc);
@@ -2122,6 +2127,9 @@ int raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 	struct btrfs_raid_bio *rbio;
 	int ret;
 
+	/* Currently we only support fixed stripe len */
+	ASSERT(stripe_len == BTRFS_STRIPE_LEN);
+
 	if (generic_io) {
 		ASSERT(bioc->mirror_num == mirror_num);
 		btrfs_bio(bio)->mirror_num = mirror_num;
@@ -2671,12 +2679,12 @@ void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
 
 struct btrfs_raid_bio *
 raid56_alloc_missing_rbio(struct bio *bio, struct btrfs_io_context *bioc,
-			  u64 length)
+			  u64 stripe_len)
 {
 	struct btrfs_fs_info *fs_info = bioc->fs_info;
 	struct btrfs_raid_bio *rbio;
 
-	rbio = alloc_rbio(fs_info, bioc, length);
+	rbio = alloc_rbio(fs_info, bioc, stripe_len);
 	if (IS_ERR(rbio))
 		return NULL;
 
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 72c00fc284b5..7322dcae4498 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -46,7 +46,7 @@ void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio);
 
 struct btrfs_raid_bio *
 raid56_alloc_missing_rbio(struct bio *bio, struct btrfs_io_context *bioc,
-			  u64 length);
+			  u64 stripe_len);
 void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio);
 
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 15a123e67108..59bb2d08e697 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1229,13 +1229,15 @@ static inline int scrub_nr_raid_mirrors(struct btrfs_io_context *bioc)
 
 static inline void scrub_stripe_index_and_offset(u64 logical, u64 map_type,
 						 u64 *raid_map,
-						 u64 mapped_length,
+						 u64 stripe_len,
 						 int nstripes, int mirror,
 						 int *stripe_index,
 						 u64 *stripe_offset)
 {
 	int i;
 
+	ASSERT(stripe_len == BTRFS_STRIPE_LEN);
+
 	if (map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
 		/* RAID5/6 */
 		for (i = 0; i < nstripes; i++) {
@@ -1244,7 +1246,7 @@ static inline void scrub_stripe_index_and_offset(u64 logical, u64 map_type,
 				continue;
 
 			if (logical >= raid_map[i] &&
-			    logical < raid_map[i] + mapped_length)
+			    logical < raid_map[i] + stripe_len)
 				break;
 		}
 
@@ -1349,7 +1351,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 			scrub_stripe_index_and_offset(logical,
 						      bioc->map_type,
 						      bioc->raid_map,
-						      mapped_length,
+						      bioc->stripe_len,
 						      bioc->num_stripes -
 						      bioc->num_tgtdevs,
 						      mirror_index,
@@ -1401,7 +1403,7 @@ static int scrub_submit_raid56_bio_wait(struct btrfs_fs_info *fs_info,
 
 	mirror_num = spage->sblock->pagev[0]->mirror_num;
 	ret = raid56_parity_recover(bio, spage->recover->bioc,
-				    spage->recover->map_length,
+				    spage->recover->bioc->stripe_len,
 				    mirror_num, 0);
 	if (ret)
 		return ret;
@@ -2230,7 +2232,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock)
 	bio->bi_private = sblock;
 	bio->bi_end_io = scrub_missing_raid56_end_io;
 
-	rbio = raid56_alloc_missing_rbio(bio, bioc, length);
+	rbio = raid56_alloc_missing_rbio(bio, bioc, bioc->stripe_len);
 	if (!rbio)
 		goto rbio_out;
 
@@ -2846,7 +2848,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
 	bio->bi_private = sparity;
 	bio->bi_end_io = scrub_parity_bio_endio;
 
-	rbio = raid56_parity_alloc_scrub_rbio(bio, bioc, length,
+	rbio = raid56_parity_alloc_scrub_rbio(bio, bioc, bioc->stripe_len,
 					      sparity->scrub_dev,
 					      sparity->dbitmap,
 					      sparity->nsectors);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fba08cfcbd4e..6d962450e355 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6051,6 +6051,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
 		ret = -ENOMEM;
 		goto out;
 	}
+	bioc->stripe_len = map->stripe_len;
 
 	for (i = 0; i < num_stripes; i++) {
 		bioc->stripes[i].physical =
@@ -6406,6 +6407,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 {
 	struct extent_map *em;
 	struct map_lookup *map;
+	const u64 orig_length = *length;
 	u64 stripe_offset;
 	u64 stripe_nr;
 	u64 stripe_len;
@@ -6427,6 +6429,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 	ASSERT(bioc_ret);
 	ASSERT(op != BTRFS_MAP_DISCARD);
+	ASSERT(orig_length);
 
 	em = btrfs_get_chunk_map(fs_info, logical, *length);
 	ASSERT(!IS_ERR(em));
@@ -6522,7 +6525,10 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 			num_stripes = map->num_stripes;
 			max_errors = nr_parity_stripes(map);
 
-			*length = map->stripe_len;
+			/* Return the length to the full stripe end */
+			*length = min(raid56_full_stripe_start + em->start +
+				      data_stripes * stripe_len,
+				      logical + orig_length) - logical;
 			stripe_index = 0;
 			stripe_offset = 0;
 		} else {
@@ -6574,6 +6580,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 		ret = -ENOMEM;
 		goto out;
 	}
+	bioc->stripe_len = map->stripe_len;
 
 	for (i = 0; i < num_stripes; i++) {
 		bioc->stripes[i].physical = map->stripes[stripe_index].physical +
@@ -6824,9 +6831,9 @@ static int submit_one_mapped_range(struct btrfs_fs_info *fs_info, struct bio *bi
 		/* In this case, map_length has been set to the length of
 		   a single stripe; not the whole write */
 		if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
-			ret = raid56_parity_write(bio, bioc, map_length);
+			ret = raid56_parity_write(bio, bioc, bioc->stripe_len);
 		} else {
-			ret = raid56_parity_recover(bio, bioc, map_length,
+			ret = raid56_parity_recover(bio, bioc, bioc->stripe_len,
 						    mirror_num, 1);
 		}
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index a7f3fd4b4226..04c016a844f8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -449,6 +449,7 @@ struct btrfs_io_context {
 	struct bio *orig_bio;
 	void *private;
 	atomic_t error;
+	u32 stripe_len;
 	int max_errors;
 	int num_stripes;
 	int mirror_num;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 13/17] btrfs: allow btrfs_map_bio() to split bio according to chunk stripe boundaries
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (11 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 12/17] btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block() Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 14/17] btrfs: remove buffered IO stripe boundary calculation Qu Wenruo
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

With the new btrfs_bio_split() helper, we are able to split bio
according to chunk stripe boundaries at btrfs_map_bio() time.

Although currently due bio split at buffered/compressed/direct IO time,
this ability is not yet utilized.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 50 +++++++++++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d962450e355..301fc34320ed 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6879,30 +6879,44 @@ static int submit_one_mapped_range(struct btrfs_fs_info *fs_info, struct bio *bi
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			   int mirror_num)
 {
-	u64 logical = bio->bi_iter.bi_sector << 9;
-	u64 length = 0;
-	u64 map_length;
+	const u64 orig_logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	const unsigned int orig_length = bio->bi_iter.bi_size;
+	const enum btrfs_map_op op = btrfs_op(bio);
+	u64 cur_logical = orig_logical;
 	int ret;
-	struct btrfs_io_context *bioc = NULL;
 
-	length = bio->bi_iter.bi_size;
-	map_length = length;
+	while (cur_logical < orig_logical + orig_length) {
+		u64 map_length = orig_logical + orig_length - cur_logical;
+		struct btrfs_io_context *bioc = NULL;
+		struct bio *cur_bio;
 
-	btrfs_bio_counter_inc_blocked(fs_info);
-	btrfs_bio_save_iter(btrfs_bio(bio));
-	ret = __btrfs_map_block(fs_info, btrfs_op(bio), logical,
-				&map_length, &bioc, mirror_num, 1);
-	if (ret) {
-		btrfs_bio_counter_dec(fs_info);
-		return errno_to_blk_status(ret);
-	}
+		btrfs_bio_save_iter(btrfs_bio(bio));
+		ret = __btrfs_map_block(fs_info, op, cur_logical, &map_length,
+					&bioc, mirror_num, 1);
+		if (ret)
+			return errno_to_blk_status(ret);
 
-	ret = submit_one_mapped_range(fs_info, bio, bioc, map_length, mirror_num);
-	if (ret < 0) {
+		if (cur_logical + map_length < orig_logical + orig_length) {
+			/*
+			 * For now zoned write should never cross stripe
+			 * boundary
+			 */
+			ASSERT(bio_op(bio) != REQ_OP_ZONE_APPEND);
+
+			/* Split the bio */
+			cur_bio = btrfs_bio_split(fs_info, bio, map_length);
+		} else {
+			/* Use the existing bio directly */
+			cur_bio = bio;
+		}
+		btrfs_bio_counter_inc_blocked(fs_info);
+		ret = submit_one_mapped_range(fs_info, cur_bio, bioc,
+					      map_length, mirror_num);
 		btrfs_bio_counter_dec(fs_info);
-		return errno_to_blk_status(ret);
+		if (ret < 0)
+			return errno_to_blk_status(ret);
+		cur_logical += map_length;
 	}
-	btrfs_bio_counter_dec(fs_info);
 	return BLK_STS_OK;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 14/17] btrfs: remove buffered IO stripe boundary calculation
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (12 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 13/17] btrfs: allow btrfs_map_bio() to split bio according to chunk stripe boundaries Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 15/17] btrfs: remove stripe boundary calculation for compressed IO Qu Wenruo
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

This will remove btrfs_bio_ctrl::len_to_stripe_boundary, so that buffer
IO will no longer limits its bio size according to stripe length.

This will move the bio split to btrfs_map_bio() for all buffered IO.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 23 ++---------------------
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb99b55ccf87..97045927b763 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3307,7 +3307,7 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 
 	ASSERT(bio);
 	/* The limit should be calculated when bio_ctrl->bio is allocated */
-	ASSERT(bio_ctrl->len_to_oe_boundary && bio_ctrl->len_to_stripe_boundary);
+	ASSERT(bio_ctrl->len_to_oe_boundary);
 	if (bio_ctrl->bio_flags != bio_flags)
 		return 0;
 
@@ -3318,9 +3318,7 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	if (!contig)
 		return 0;
 
-	real_size = min(bio_ctrl->len_to_oe_boundary,
-			bio_ctrl->len_to_stripe_boundary) - bio_size;
-	real_size = min(real_size, size);
+	real_size = min(bio_ctrl->len_to_oe_boundary - bio_size, size);
 
 	/*
 	 * If real_size is 0, never call bio_add_*_page(), as even size is 0,
@@ -3341,11 +3339,8 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 			       struct btrfs_inode *inode, u64 file_offset)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_io_geometry geom;
 	struct btrfs_ordered_extent *ordered;
-	struct extent_map *em;
 	u64 logical = (bio_ctrl->bio->bi_iter.bi_sector << SECTOR_SHIFT);
-	int ret;
 
 	/*
 	 * Pages for compressed extent are never submitted to disk directly,
@@ -3356,22 +3351,8 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 	 */
 	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED) {
 		bio_ctrl->len_to_oe_boundary = U32_MAX;
-		bio_ctrl->len_to_stripe_boundary = U32_MAX;
 		return 0;
 	}
-	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
-	if (IS_ERR(em))
-		return PTR_ERR(em);
-	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
-				    logical, &geom);
-	free_extent_map(em);
-	if (ret < 0) {
-		return ret;
-	}
-	if (geom.len > U32_MAX)
-		bio_ctrl->len_to_stripe_boundary = U32_MAX;
-	else
-		bio_ctrl->len_to_stripe_boundary = (u32)geom.len;
 
 	if (!btrfs_is_zoned(fs_info) ||
 	    bio_op(bio_ctrl->bio) != REQ_OP_ZONE_APPEND) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 15/17] btrfs: remove stripe boundary calculation for compressed IO
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (13 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 14/17] btrfs: remove buffered IO stripe boundary calculation Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 16/17] btrfs: remove the stripe boundary calculation for direct IO Qu Wenruo
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

For compressed IO, we calculate the next stripe start inside
alloc_compressed_bio().

Since now btrfs_map_bio() can handle bio split, we no longer need to
calculate the boundary any more.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 49 +++++-------------------------------------
 1 file changed, 5 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 8b4b84b59b0c..70af7d3973b7 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -442,21 +442,15 @@ static blk_status_t submit_compressed_bio(struct btrfs_fs_info *fs_info,
  *                      from or written to.
  * @endio_func:         The endio function to call after the IO for compressed data
  *                      is finished.
- * @next_stripe_start:  Return value of logical bytenr of where next stripe starts.
- *                      Let the caller know to only fill the bio up to the stripe
- *                      boundary.
  */
 
 
 static struct bio *alloc_compressed_bio(struct compressed_bio *cb, u64 disk_bytenr,
-					unsigned int opf, bio_end_io_t endio_func,
-					u64 *next_stripe_start)
+					unsigned int opf, bio_end_io_t endio_func)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
-	struct btrfs_io_geometry geom;
 	struct extent_map *em;
 	struct bio *bio;
-	int ret;
 
 	bio = btrfs_bio_alloc(BIO_MAX_VECS);
 
@@ -473,14 +467,7 @@ static struct bio *alloc_compressed_bio(struct compressed_bio *cb, u64 disk_byte
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
 		bio_set_dev(bio, em->map_lookup->stripes[0].dev->bdev);
-
-	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), disk_bytenr, &geom);
 	free_extent_map(em);
-	if (ret < 0) {
-		bio_put(bio);
-		return ERR_PTR(ret);
-	}
-	*next_stripe_start = disk_bytenr + geom.len;
 
 	return bio;
 }
@@ -506,7 +493,6 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	struct bio *bio = NULL;
 	struct compressed_bio *cb;
 	u64 cur_disk_bytenr = disk_start;
-	u64 next_stripe_start;
 	blk_status_t ret;
 	int skip_sum = inode->flags & BTRFS_INODE_NODATASUM;
 	const bool use_append = btrfs_use_zone_append(inode, disk_start);
@@ -539,28 +525,19 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 		/* Allocate new bio if submitted or not yet allocated */
 		if (!bio) {
 			bio = alloc_compressed_bio(cb, cur_disk_bytenr,
-				bio_op | write_flags, end_compressed_bio_write,
-				&next_stripe_start);
+				bio_op | write_flags, end_compressed_bio_write);
 			if (IS_ERR(bio)) {
 				ret = errno_to_blk_status(PTR_ERR(bio));
 				bio = NULL;
 				goto finish_cb;
 			}
 		}
-		/*
-		 * We should never reach next_stripe_start start as we will
-		 * submit comp_bio when reach the boundary immediately.
-		 */
-		ASSERT(cur_disk_bytenr != next_stripe_start);
-
 		/*
 		 * We have various limits on the real read size:
-		 * - stripe boundary
 		 * - page boundary
 		 * - compressed length boundary
 		 */
-		real_size = min_t(u64, U32_MAX, next_stripe_start - cur_disk_bytenr);
-		real_size = min_t(u64, real_size, PAGE_SIZE - offset_in_page(offset));
+		real_size = min_t(u64, U32_MAX, PAGE_SIZE - offset_in_page(offset));
 		real_size = min_t(u64, real_size, compressed_len - offset);
 		ASSERT(IS_ALIGNED(real_size, fs_info->sectorsize));
 
@@ -575,9 +552,6 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			submit = true;
 
 		cur_disk_bytenr += added;
-		/* Reached stripe boundary */
-		if (cur_disk_bytenr == next_stripe_start)
-			submit = true;
 
 		/* Finished the range */
 		if (cur_disk_bytenr == disk_start + compressed_len)
@@ -797,7 +771,6 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	struct bio *comp_bio = NULL;
 	const u64 disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 cur_disk_byte = disk_bytenr;
-	u64 next_stripe_start;
 	u64 file_offset;
 	u64 em_len;
 	u64 em_start;
@@ -878,27 +851,19 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		/* Allocate new bio if submitted or not yet allocated */
 		if (!comp_bio) {
 			comp_bio = alloc_compressed_bio(cb, cur_disk_byte,
-					REQ_OP_READ, end_compressed_bio_read,
-					&next_stripe_start);
+					REQ_OP_READ, end_compressed_bio_read);
 			if (IS_ERR(comp_bio)) {
 				ret = errno_to_blk_status(PTR_ERR(comp_bio));
 				comp_bio = NULL;
 				goto finish_cb;
 			}
 		}
-		/*
-		 * We should never reach next_stripe_start start as we will
-		 * submit comp_bio when reach the boundary immediately.
-		 */
-		ASSERT(cur_disk_byte != next_stripe_start);
 		/*
 		 * We have various limit on the real read size:
-		 * - stripe boundary
 		 * - page boundary
 		 * - compressed length boundary
 		 */
-		real_size = min_t(u64, U32_MAX, next_stripe_start - cur_disk_byte);
-		real_size = min_t(u64, real_size, PAGE_SIZE - offset_in_page(offset));
+		real_size = min_t(u64, U32_MAX, PAGE_SIZE - offset_in_page(offset));
 		real_size = min_t(u64, real_size, compressed_len - offset);
 		ASSERT(IS_ALIGNED(real_size, fs_info->sectorsize));
 
@@ -910,10 +875,6 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		ASSERT(added == real_size);
 		cur_disk_byte += added;
 
-		/* Reached stripe boundary, need to submit */
-		if (cur_disk_byte == next_stripe_start)
-			submit = true;
-
 		/* Has finished the range, need to submit */
 		if (cur_disk_byte == disk_bytenr + compressed_len)
 			submit = true;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 16/17] btrfs: remove the stripe boundary calculation for direct IO
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (14 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 15/17] btrfs: remove stripe boundary calculation for compressed IO Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-06  2:29 ` [PATCH v2 17/17] btrfs: unexport btrfs_get_io_geometry() Qu Wenruo
  2021-12-09 10:06 ` [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Johannes Thumshirn
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

In btrfs_submit_direct() we have a do {} while () loop to handle the bio
split due to stripe boundary.

Since btrfs_map_bio() can handle it for us now, there is no need to
manually do the split anymore.

Also since we don't need to split bio, there is no special check for
RAID56 anymore, make btrfs_submit_dio_bio() to have the same rule as
btrfs_submit_data_bio() for async submit.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 113 ++++++++++-------------------------------------
 1 file changed, 24 insertions(+), 89 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 186304c69900..8ffec0fe6c4e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8222,22 +8222,16 @@ static void btrfs_end_dio_bio(struct bio *bio)
 }
 
 static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
-		struct inode *inode, u64 file_offset, int async_submit)
+		struct inode *inode, u64 file_offset)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_dio_private *dip = bio->bi_private;
 	bool write = btrfs_op(bio) == BTRFS_MAP_WRITE;
+	bool async_submit;
 	blk_status_t ret;
 
-	/*
-	 * Check btrfs_submit_data_bio() for rules about async submit.
-	 *
-	 * The only exception is for RAID56, when there are more than one bios
-	 * to submit, async submit seems to make it harder to collect csums
-	 * for the full stripe.
-	 */
-	if (async_submit)
-		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers);
+	/* Check btrfs_submit_data_bio() for rules about async submit. */
+	async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers);
 
 	if (!write)
 		btrfs_bio(bio)->endio_type = BTRFS_WQ_ENDIO_DATA;
@@ -8311,25 +8305,12 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 		struct bio *dio_bio, loff_t file_offset)
 {
 	struct inode *inode = iter->inode;
+	struct btrfs_dio_data *dio_data = iter->iomap.private;
 	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	const bool raid56 = (btrfs_data_alloc_profile(fs_info) &
-			     BTRFS_BLOCK_GROUP_RAID56_MASK);
 	struct btrfs_dio_private *dip;
 	struct bio *bio;
 	const u32 length = dio_bio->bi_iter.bi_size;
-	u32 submitted_bytes = 0;
-	u64 start_sector;
-	int async_submit = 0;
-	u64 submit_len;
-	u64 clone_offset = 0;
-	u64 clone_len;
-	u64 logical;
-	int ret;
 	blk_status_t status;
-	struct btrfs_io_geometry geom;
-	struct btrfs_dio_data *dio_data = iter->iomap.private;
-	struct extent_map *em = NULL;
 
 	dip = btrfs_create_dio_private(dio_bio, inode, file_offset, length);
 	if (!dip) {
@@ -8353,80 +8334,34 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 			goto out_err;
 	}
 
-	start_sector = dio_bio->bi_iter.bi_sector;
-	submit_len = dio_bio->bi_iter.bi_size;
-
-	do {
-		logical = start_sector << 9;
-		em = btrfs_get_chunk_map(fs_info, logical, submit_len);
-		if (IS_ERR(em)) {
-			status = errno_to_blk_status(PTR_ERR(em));
-			em = NULL;
-			goto out_err_em;
-		}
-		ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
-					    logical, &geom);
-		if (ret) {
-			status = errno_to_blk_status(ret);
-			goto out_err_em;
-		}
-
-		clone_len = min(submit_len, geom.len);
-		ASSERT(clone_len <= UINT_MAX);
-
-		/*
-		 * This will never fail as it's passing GPF_NOFS and
-		 * the allocation is backed by btrfs_bioset.
-		 */
-		bio = btrfs_bio_clone_partial(dio_bio, clone_offset, clone_len);
-		bio->bi_private = dip;
-		bio->bi_end_io = btrfs_end_dio_bio;
-
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			status = extract_ordered_extent(BTRFS_I(inode), bio,
-							file_offset);
-			if (status) {
-				bio_put(bio);
-				goto out_err;
-			}
-		}
-
-		ASSERT(submit_len >= clone_len);
-		submit_len -= clone_len;
+	/*
+	 * This will never fail as it's passing GPF_NOFS and
+	 * the allocation is backed by btrfs_bioset.
+	 */
+	bio = btrfs_bio_clone(dio_bio);
+	bio->bi_private = dip;
+	bio->bi_end_io = btrfs_end_dio_bio;
 
-		if (submit_len > 0) {
-			/*
-			 * If we are submitting more than one bio, submit them
-			 * all asynchronously. The exception is RAID 5 or 6, as
-			 * asynchronous checksums make it difficult to collect
-			 * full stripe writes.
-			 */
-			if (!raid56)
-				async_submit = 1;
-		}
 
-		status = btrfs_submit_dio_bio(bio, inode, file_offset,
-						async_submit);
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		status = extract_ordered_extent(BTRFS_I(inode), bio,
+						file_offset);
 		if (status) {
 			bio_put(bio);
-			goto out_err_em;
+			goto out_err;
 		}
-
-		submitted_bytes += clone_len;
-		dio_data->submitted += clone_len;
-		clone_offset += clone_len;
-		start_sector += clone_len >> 9;
-		file_offset += clone_len;
-
-		free_extent_map(em);
-	} while (submit_len > 0);
+	}
+	status = btrfs_submit_dio_bio(bio, inode, file_offset);
+	if (status) {
+		bio_put(bio);
+		goto out_err;
+	}
+	dio_data->submitted += length;
 	return;
 
-out_err_em:
-	free_extent_map(em);
 out_err:
 	dip->dio_bio->bi_status = status;
-	dio_private_finish(dip, status, length - submitted_bytes);
+	dio_private_finish(dip, status, length);
 }
 
 const struct iomap_ops btrfs_dio_iomap_ops = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 17/17] btrfs: unexport btrfs_get_io_geometry()
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (15 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 16/17] btrfs: remove the stripe boundary calculation for direct IO Qu Wenruo
@ 2021-12-06  2:29 ` Qu Wenruo
  2021-12-09 10:06 ` [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Johannes Thumshirn
  17 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-06  2:29 UTC (permalink / raw)
  To: linux-btrfs

This function provides a lighter weight version of btrfs_map_block(),
just to provide enough info without filling everything of
btrfs_map_block().

But that function is only used for stripe boundary calculation, and now
stripe boundary calculation is all handled inside btrfs_map_bio(), there
is no need to export it anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 8 ++++----
 fs/btrfs/volumes.h | 3 ---
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 301fc34320ed..61d281892449 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6320,9 +6320,9 @@ static bool need_full_stripe(enum btrfs_map_op op)
  * Returns < 0 in case a chunk for the given logical address cannot be found,
  * usually shouldn't happen unless @logical is corrupted, 0 otherwise.
  */
-int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *em,
-			  enum btrfs_map_op op, u64 logical,
-			  struct btrfs_io_geometry *io_geom)
+static int get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *em,
+			   enum btrfs_map_op op, u64 logical,
+			   struct btrfs_io_geometry *io_geom)
 {
 	struct map_lookup *map;
 	u64 len;
@@ -6434,7 +6434,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 	em = btrfs_get_chunk_map(fs_info, logical, *length);
 	ASSERT(!IS_ERR(em));
 
-	ret = btrfs_get_io_geometry(fs_info, em, op, logical, &geom);
+	ret = get_io_geometry(fs_info, em, op, logical, &geom);
 	if (ret < 0)
 		return ret;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 04c016a844f8..d5dbe7f946e0 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -561,9 +561,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		     u64 logical, u64 *length,
 		     struct btrfs_io_context **bioc_ret);
-int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *map,
-			  enum btrfs_map_op op, u64 logical,
-			  struct btrfs_io_geometry *io_geom);
 int btrfs_read_sys_array(struct btrfs_fs_info *fs_info);
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info);
 struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 06/17] btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes
  2021-12-06  2:29 ` [PATCH v2 06/17] btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes Qu Wenruo
@ 2021-12-09 10:02   ` Johannes Thumshirn
  2021-12-09 10:35     ` Qu Wenruo
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2021-12-09 10:02 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 06/12/2021 03:30, Qu Wenruo wrote:
> This mostly follows the behavior of compressed_bio::pending_sectors.
> 
> The point here is, dip::refs is not split bio friendly, as if a bio with
> its bi_private = dip, and the bio get split, we can easily underflow
> dip::refs.
> 
> By using the same sector based solution as compressed_bio, dio can
> handle both unsplit and split bios.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>


JFYI, for this patch I get checkpatch complains:

Applying: btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes
.git/rebase-apply/patch:37: space before tab in indent.
                                     u32 bytes)
warning: 1 line adds whitespace errors.
ERROR:CODE_INDENT: code indent should use tabs where possible
#32: FILE: fs/btrfs/inode.c:7693:
+^I    ^I^I^I     u32 bytes)$

WARNING:SPACE_BEFORE_TAB: please, no space before tabs
#32: FILE: fs/btrfs/inode.c:7693:
+^I    ^I^I^I     u32 bytes)$

total: 1 errors, 1 warnings, 180 lines checked


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time
  2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
                   ` (16 preceding siblings ...)
  2021-12-06  2:29 ` [PATCH v2 17/17] btrfs: unexport btrfs_get_io_geometry() Qu Wenruo
@ 2021-12-09 10:06 ` Johannes Thumshirn
  2021-12-09 10:52   ` Johannes Thumshirn
  17 siblings, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2021-12-09 10:06 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 06/12/2021 03:30, Qu Wenruo wrote:
> This patchset be fetched from this branch:
> 
> https://github.com/adam900710/linux/tree/refactor_chunk_map
> 
> [BACKGROUND]
> 
> Currently btrfs never uses bio_split() to split its bio against RAID
> stripe boundaries.
> 
> Instead inside btrfs we check our stripe boundary everytime we allocate
> a new bio, and ensure the new bio never cross stripe boundaries.
> 
> [PROBLEMS]
> 
> Although this behavior works fine, it's against the common practice used in
> stacked drivers, and is making the effort to convert to iomap harder.
> 
> There is also an hidden burden, every time we allocate a new bio, we uses
> BIO_MAX_BVECS, but since we know the boundaries, for RAID0/RAID10 we can
> only fit at most 16 pages (fixed 64K stripe size, and 4K page size),
> wasting the 256 slots we allocated.
> 
> [CHALLENGES]
> 
> To change the situation, this patchset attempts to improve the situation
> by moving the bio split into btrfs_map_bio() time, so upper layer should
> no longer bother the bio split against RAID stripes or even chunk
> boundaries.
> 
> But there are several challenges:
> 
> - Conflicts in various endio functions
>   We want the existing granularity, instead of chained endio, thus we
>   must make the involved endio functions to handle split bios.
> 
>   Although most endio functions are already doing their works
>   independent of the bio size, they are not yet fully handling split
>   bio.
> 
>   This patch will convert them to use saved bi_iter and only iterate
>   the split range instead of the whole bio.
>   This change involved 3 types of IOs:
> 
>   * Buffered IO
>     Including both data and metadata
>   * Direct IO
>   * Compressed IO
> 
>   Their endio functions needs different level of updates to handle split
>   bios.
> 
>   Furthermore, there is another endio, end_workqueue_bio(), it can't
>   handle split bios at all, thus we change the timing so that
>   btrfs_bio_wq_end_io() is only called after the bio being split.
> 
> - Checksum verification
>   Currently we rely on btrfs_bio::csum to contain the checksum for the
>   whole bio.
>   If one bio get split, csum will no longer points to the correct
>   location for the split bio.
> 
>   This can be solved by introducing btrfs_bio::offset_to_original, and
>   use that new member to calculate where we should read csum from.
> 
>   For the parent bio, it still has btrfs_bio::csum for the whole bio,
>   thus it can still free it correctly.
> 
> - Independent endio for each split bio
>   Unlike stack drivers, for RAID10 btrfs needs to try its best effort to
>   read every sectors, to handle the following case: (X means bad, either
>   unable to read or failed to pass checksum verification, V means good)
> 
>   Dev 1	(missing) | D1 (X) |
>   Dev 2 (OK)	  | D1 (V) |
>   Dev 3 (OK)	  | D2 (V) |
>   Dev 4 (OK)	  | D2 (X) |
> 
>   In the above RAID10 case, dev1 is missing, and although dev4 is fine,
>   its D2 sector is corrupted (by bit rot or whatever).
> 
>   If we use bio_chain(), read bio for both D1 and D2 will be split, and
>   since D1 is missing, the whole D1 and D2 read will be marked as error,
>   thus we will try to read from dev2 and dev4.
> 
>   But D2 in dev4 has csum mismatch, we can only rebuild D1 and D2
>   correctly from dev2:D1 and dev3:D2.
> 
>   This patchset resolve this by saving bi_iter into btrfs_bio::iter, and
>   uses that at endio to iterate only the split part of an bio.
>   Other than this, existing read/write page endio functions can handle
>   them properly without problem.
> 
> - Bad RAID56 naming/functionality
>   There are quite some RAID56 call sites relies on specific behavior on
>   __btrfs_map_block(), like returning @map_length as stripe_len other
>   than real mapped length.
> 
>   This is handled by some small cleanups specific for RAID56.
> 
> [CHANGELOG]
> RFC->v1:
> - Better patch split
>   Now patch 01~06 are refactors/cleanups/preparations.
>   While 07~13 are the patches that doing the conversion while can handle
>   both old and new bio split timings.
>   Finally patch 14~16 convert the bio split call sites one by one to
>   newer facility.
>   The final patch is just a small clean up.
> 
> - Various bug fixes
>   During the full fstests run, various stupid bugs are exposed and
>   fixed.
> 
> v2:
> - Fix the error paths for allocated but never submitted bios
>   There are tons of error path that we allocate a bio but it goes
>   bio_endio() directly without going through btrfs_map_bio().
>   New ASSERTS() in endio functions require a populated btrfs_bio::iter,
>   thus for such bios we still need to call btrfs_bio_save_iter() to
>   populate btrfs_bio::iter to prevent such ASSERT()s get triggered.
> 
> - Fix scrub_stripe_index_and_offset() which abuses stripe_len and
>   mapped_length
> 


FYI the patchset doesn't apply cleanly to misc-next anymore. I've
pulled your branch form github and queued it for testing on zoned
devices.

I'll report any findings.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 06/17] btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes
  2021-12-09 10:02   ` Johannes Thumshirn
@ 2021-12-09 10:35     ` Qu Wenruo
  0 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2021-12-09 10:35 UTC (permalink / raw)
  To: Johannes Thumshirn, Qu Wenruo, linux-btrfs



On 2021/12/9 18:02, Johannes Thumshirn wrote:
> On 06/12/2021 03:30, Qu Wenruo wrote:
>> This mostly follows the behavior of compressed_bio::pending_sectors.
>>
>> The point here is, dip::refs is not split bio friendly, as if a bio with
>> its bi_private = dip, and the bio get split, we can easily underflow
>> dip::refs.
>>
>> By using the same sector based solution as compressed_bio, dio can
>> handle both unsplit and split bios.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
>
> JFYI, for this patch I get checkpatch complains:
>
> Applying: btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes
> .git/rebase-apply/patch:37: space before tab in indent.
>                                       u32 bytes)
> warning: 1 line adds whitespace errors.
> ERROR:CODE_INDENT: code indent should use tabs where possible
> #32: FILE: fs/btrfs/inode.c:7693:
> +^I    ^I^I^I     u32 bytes)$
>
> WARNING:SPACE_BEFORE_TAB: please, no space before tabs
> #32: FILE: fs/btrfs/inode.c:7693:
> +^I    ^I^I^I     u32 bytes)$
>
> total: 1 errors, 1 warnings, 180 lines checked
>
Oh, I should add checkpatch to my git hooks...

Thanks for catching this, will rebase the branch to misc-next with this
fixed.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time
  2021-12-09 10:06 ` [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Johannes Thumshirn
@ 2021-12-09 10:52   ` Johannes Thumshirn
  2021-12-09 11:08     ` Qu Wenruo
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2021-12-09 10:52 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 09/12/2021 11:07, Johannes Thumshirn wrote:
> 
> 
> FYI the patchset doesn't apply cleanly to misc-next anymore. I've
> pulled your branch form github and queued it for testing on zoned
> devices.
> 
> I'll report any findings.
> 

Unfortunately I do have something to report:

generic/068     [ 2020.934379] BTRFS critical (device nullb1): corrupt leaf: root=5 block=4339220480 slot=64 ino=2431 file_offset=962560, invalid disk_bytenr for file extent, have 5100404224, should be aligned to 4096
[ 2020.938165] BTRFS: error (device nullb1) in btrfs_commit_transaction:2310: errno=-5 IO failure (Error while writing out transaction)
[ 2020.938688] BTRFS: error (device nullb1) in btrfs_finish_ordered_io:3110: errno=-5 IO failure
[ 2020.939982] BTRFS: error (device nullb1) in cleanup_transaction:1913: errno=-5 IO failure                                          
[ 2020.941938] kernel BUG at fs/btrfs/ctree.h:3516!                                        
[ 2020.942344] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI                                                                                                                            
[ 2020.942802] CPU: 1 PID: 26201 Comm: fstest Tainted: G        W         5.16.0-rc3-qu-bio-split #30
[ 2020.943043] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092880384, 5092884480)                          
[ 2020.943576] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
[ 2020.944424] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092884480, 5092888576)
[ 2020.945191] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]                                                                                                                     
[ 2020.946076] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092888576, 5092892672)                          
[ 2020.945833] Code: e8 ea 1a a0 48 c7 c7 20 eb 1a a0 e8 1b a7 43 e1 0f 0b 89 f1 48 c7 c2 9c 9a 1a a0 48 89 fe 48 c7 c7 48 eb 1a a0 e8 01 a7 43 e1 <0f> 0b be 57 00 00 00 48 c7 c7 70 eb 1a a0 e8 d5 ff ff ff be 73 00
[ 2020.947374] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092892672, 5092896768)
[ 2020.945833] RSP: 0018:ffffc90004c8b890 EFLAGS: 00010296                                                                                                                            
[ 2020.949774] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092896768, 5092900864)
[ 2020.945833] RAX: 0000000000000071 RBX: 0000000000001000 RCX: 0000000000000000                                                                                                      
[ 2020.945833] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff                                                                                                      
[ 2020.945833] RBP: ffff888139612700 R08: ffffffff81ca4a40 R09: 00000000ffffff74                                                                                                      
[ 2020.945833] R10: ffffffff81c35760 R11: ffffffff81c35760 R12: 000000013003fe00                                                                                                      
[ 2020.945833] R13: 0000000000000001 R14: ffff888101a8c000 R15: 0000000000000004                                                                                                      
[ 2020.945833] FS:  00007f04d5acf740(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000                                                                                           
[ 2020.945833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                                                                      
[ 2020.945833] CR2: 0000000000d7e008 CR3: 00000001bb7f0000 CR4: 00000000000006a0                                                                                                      
[ 2020.945833] Call Trace:
[ 2020.945833]  <TASK>
[ 2020.945833]  btrfs_lookup_bio_sums.cold+0x3f/0x61 [btrfs]
[ 2020.951120]  ? kmem_cache_alloc+0x100/0x1d0
[ 2020.951120]  ? mempool_alloc+0x4d/0x150
[ 2020.951120]  btrfs_submit_data_bio+0xeb/0x200 [btrfs]
[ 2020.951120]  ? bio_alloc_bioset+0x228/0x300
[ 2020.951120]  submit_one_bio+0x60/0x90 [btrfs]
[ 2020.951120]  submit_extent_page+0x175/0x460 [btrfs]
[ 2020.951120]  btrfs_do_readpage+0x263/0x800 [btrfs]
[ 2020.951120]  ? btrfs_repair_one_sector+0x450/0x450 [btrfs]
[ 2020.951120]  extent_readahead+0x296/0x380 [btrfs]
[ 2020.951120]  ? __mod_node_page_state+0x77/0xb0
[ 2020.951120]  ? __filemap_add_folio+0x115/0x190
[ 2020.951120]  read_pages+0x57/0x1a0
[ 2020.951120]  page_cache_ra_unbounded+0x15c/0x1e0
[ 2020.951120]  filemap_get_pages+0xcf/0x640
[ 2020.951120]  ? terminate_walk+0x5c/0xf0
[ 2020.951120]  filemap_read+0xb9/0x2a0
[ 2020.951120]  ? arch_stack_walk+0x77/0xb0
[ 2020.951120]  ? do_filp_open+0x9a/0x120
[ 2020.951120]  new_sync_read+0x103/0x170
[ 2020.951120]  ? 0xffffffff81000000
[ 2020.951120]  vfs_read+0x121/0x1a0
[ 2020.951120]  __x64_sys_pread64+0x69/0xa0
[ 2020.951120]  do_syscall_64+0x43/0x90
[ 2020.951120]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2020.951120] RIP: 0033:0x7f04d5cce1aa
[ 2020.951120] Code: d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 2020.951120] RSP: 002b:00007fff31411768 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[ 2020.951120] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04d5cce1aa
[ 2020.951120] RDX: 0000000000000400 RSI: 0000000000fab2b0 RDI: 0000000000000003
[ 2020.951120] RBP: 0000000000fab2b0 R08: 0000000000000000 R09: 00007fff31411507
[ 2020.951120] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
[ 2020.951120] R13: 0000000000000000 R14: 0000000000000400 R15: 0000000000fab6c0
[ 2020.951120]  </TASK>
[ 2020.951120] Modules linked in: dm_flakey loop btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq zstd_decompress zstd_compress xxhash null_blk
[ 2020.972836] ---[ end trace ceb9e45abcff5d95 ]---
[ 2020.973244] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
[ 2020.973773] Code: e8 ea 1a a0 48 c7 c7 20 eb 1a a0 e8 1b a7 43 e1 0f 0b 89 f1 48 c7 c2 9c 9a 1a a0 48 89 fe 48 c7 c7 48 eb 1a a0 e8 01 a7 43 e1 <0f> 0b be 57 00 00 00 48 c7 c7 70 eb 1a a0 e8 d5 ff ff ff be 73 00
[ 2020.973854] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 770048 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.975411] RSP: 0018:ffffc90004c8b890 EFLAGS: 00010296
[ 2020.976383] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 1, gen 0
[ 2020.976841] RAX: 0000000000000071 RBX: 0000000000001000 RCX: 0000000000000000
[ 2020.977656] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 774144 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.978272] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff
[ 2020.979247] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 2, gen 0
[ 2020.979853] RBP: ffff888139612700 R08: ffffffff81ca4a40 R09: 00000000ffffff74
[ 2020.980629] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 778240 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.981224] R10: ffffffff81c35760 R11: ffffffff81c35760 R12: 000000013003fe00
[ 2020.982211] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 3, gen 0
[ 2020.982835] R13: 0000000000000001 R14: ffff888101a8c000 R15: 0000000000000004
[ 2020.983624] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 782336 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.984226] FS:  00007f04d5acf740(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
[ 2020.985216] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 4, gen 0
[ 2020.985915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2020.986683] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 786432 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.987157] CR2: 0000000000d7e008 CR3: 00000001bb7f0000 CR4: 00000000000006a0
[ 2020.988150] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 5, gen 0
[ 2020.989019] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092880384, 5092884480)
[ 2020.990401] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 770048 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
[ 2020.991403] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 6, gen 0
[ 2020.992443] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090144256, 5090148352)
[ 2020.993277] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090148352, 5090152448)
[ 2020.994148] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090152448, 5090156544)
[ 2020.994989] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090156544, 5090160640)
[ 2020.996065] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2179072 csum 0xa9788697 expected csum 0x00000000 mirror 1
[ 2020.996156] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2183168 csum 0xa9788697 expected csum 0x00000000 mirror 1
[ 2020.997107] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 7, gen 0
[ 2020.998140] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 8, gen 0
[ 2020.999004] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2187264 csum 0xa9788697 expected csum 0x00000000 mirror 1
[ 2020.999793] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2191360 csum 0xa9788697 expected csum 0x00000000 mirror 1
[ 2021.000815] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 9, gen 0

Resolving btrfs_lookup_bio_sums.cold+0x3f doesn't make any sense though:
(gdb) l *btrfs_lookup_bio_sums+0x3f
0x1767f is in btrfs_lookup_bio_sums (fs/btrfs/file-item.c:372).
367     blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst)
368     {
369             struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
370             struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
371             struct btrfs_path *path;
372             const u32 sectorsize = fs_info->sectorsize;
373             const u32 csum_size = fs_info->csum_size;
374             u32 orig_len = bio->bi_iter.bi_size;
375             u64 orig_disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
376             u64 cur_disk_bytenr;
(gdb)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time
  2021-12-09 10:52   ` Johannes Thumshirn
@ 2021-12-09 11:08     ` Qu Wenruo
  2021-12-09 11:13       ` Johannes Thumshirn
  0 siblings, 1 reply; 26+ messages in thread
From: Qu Wenruo @ 2021-12-09 11:08 UTC (permalink / raw)
  To: Johannes Thumshirn, Qu Wenruo, linux-btrfs



On 2021/12/9 18:52, Johannes Thumshirn wrote:
> On 09/12/2021 11:07, Johannes Thumshirn wrote:
>>
>>
>> FYI the patchset doesn't apply cleanly to misc-next anymore. I've
>> pulled your branch form github and queued it for testing on zoned
>> devices.
>>
>> I'll report any findings.
>>
>
> Unfortunately I do have something to report:
>
> generic/068     [ 2020.934379] BTRFS critical (device nullb1): corrupt leaf: root=5 block=4339220480 slot=64 ino=2431 file_offset=962560, invalid disk_bytenr for file extent, have 5100404224, should be aligned to 4096

No more error message after this line?

I thought it should be either write time or read time tree-checker
error, but I can't see the message indicating the timing.

And yes, that disk_bytenr is indeed not aligned.

> [ 2020.938165] BTRFS: error (device nullb1) in btrfs_commit_transaction:2310: errno=-5 IO failure (Error while writing out transaction)
> [ 2020.938688] BTRFS: error (device nullb1) in btrfs_finish_ordered_io:3110: errno=-5 IO failure
> [ 2020.939982] BTRFS: error (device nullb1) in cleanup_transaction:1913: errno=-5 IO failure
> [ 2020.941938] kernel BUG at fs/btrfs/ctree.h:3516!

And this is the most weird part, it's from assertfail(), but no line
showing the line number.

Mind to provide the full dmesg?
I guess some important lines are not included.

> [ 2020.942344] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 2020.942802] CPU: 1 PID: 26201 Comm: fstest Tainted: G        W         5.16.0-rc3-qu-bio-split #30
> [ 2020.943043] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092880384, 5092884480)

This is from btrfs_lookup_bio_sums(), meaning we're submitting a read,
while the range doesn't have the csum, and it's not NODATASUM inode.

Not sure if it's related to commit transaction error.

Thanks,
Qu
> [ 2020.943576] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
> [ 2020.944424] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092884480, 5092888576)
> [ 2020.945191] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
> [ 2020.946076] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092888576, 5092892672)
> [ 2020.945833] Code: e8 ea 1a a0 48 c7 c7 20 eb 1a a0 e8 1b a7 43 e1 0f 0b 89 f1 48 c7 c2 9c 9a 1a a0 48 89 fe 48 c7 c7 48 eb 1a a0 e8 01 a7 43 e1 <0f> 0b be 57 00 00 00 48 c7 c7 70 eb 1a a0 e8 d5 ff ff ff be 73 00
> [ 2020.947374] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092892672, 5092896768)
> [ 2020.945833] RSP: 0018:ffffc90004c8b890 EFLAGS: 00010296
> [ 2020.949774] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092896768, 5092900864)
> [ 2020.945833] RAX: 0000000000000071 RBX: 0000000000001000 RCX: 0000000000000000
> [ 2020.945833] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff
> [ 2020.945833] RBP: ffff888139612700 R08: ffffffff81ca4a40 R09: 00000000ffffff74
> [ 2020.945833] R10: ffffffff81c35760 R11: ffffffff81c35760 R12: 000000013003fe00
> [ 2020.945833] R13: 0000000000000001 R14: ffff888101a8c000 R15: 0000000000000004
> [ 2020.945833] FS:  00007f04d5acf740(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
> [ 2020.945833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2020.945833] CR2: 0000000000d7e008 CR3: 00000001bb7f0000 CR4: 00000000000006a0
> [ 2020.945833] Call Trace:
> [ 2020.945833]  <TASK>
> [ 2020.945833]  btrfs_lookup_bio_sums.cold+0x3f/0x61 [btrfs]
> [ 2020.951120]  ? kmem_cache_alloc+0x100/0x1d0
> [ 2020.951120]  ? mempool_alloc+0x4d/0x150
> [ 2020.951120]  btrfs_submit_data_bio+0xeb/0x200 [btrfs]
> [ 2020.951120]  ? bio_alloc_bioset+0x228/0x300
> [ 2020.951120]  submit_one_bio+0x60/0x90 [btrfs]
> [ 2020.951120]  submit_extent_page+0x175/0x460 [btrfs]
> [ 2020.951120]  btrfs_do_readpage+0x263/0x800 [btrfs]
> [ 2020.951120]  ? btrfs_repair_one_sector+0x450/0x450 [btrfs]
> [ 2020.951120]  extent_readahead+0x296/0x380 [btrfs]
> [ 2020.951120]  ? __mod_node_page_state+0x77/0xb0
> [ 2020.951120]  ? __filemap_add_folio+0x115/0x190
> [ 2020.951120]  read_pages+0x57/0x1a0
> [ 2020.951120]  page_cache_ra_unbounded+0x15c/0x1e0
> [ 2020.951120]  filemap_get_pages+0xcf/0x640
> [ 2020.951120]  ? terminate_walk+0x5c/0xf0
> [ 2020.951120]  filemap_read+0xb9/0x2a0
> [ 2020.951120]  ? arch_stack_walk+0x77/0xb0
> [ 2020.951120]  ? do_filp_open+0x9a/0x120
> [ 2020.951120]  new_sync_read+0x103/0x170
> [ 2020.951120]  ? 0xffffffff81000000
> [ 2020.951120]  vfs_read+0x121/0x1a0
> [ 2020.951120]  __x64_sys_pread64+0x69/0xa0
> [ 2020.951120]  do_syscall_64+0x43/0x90
> [ 2020.951120]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 2020.951120] RIP: 0033:0x7f04d5cce1aa
> [ 2020.951120] Code: d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [ 2020.951120] RSP: 002b:00007fff31411768 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [ 2020.951120] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04d5cce1aa
> [ 2020.951120] RDX: 0000000000000400 RSI: 0000000000fab2b0 RDI: 0000000000000003
> [ 2020.951120] RBP: 0000000000fab2b0 R08: 0000000000000000 R09: 00007fff31411507
> [ 2020.951120] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
> [ 2020.951120] R13: 0000000000000000 R14: 0000000000000400 R15: 0000000000fab6c0
> [ 2020.951120]  </TASK>
> [ 2020.951120] Modules linked in: dm_flakey loop btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq zstd_decompress zstd_compress xxhash null_blk
> [ 2020.972836] ---[ end trace ceb9e45abcff5d95 ]---
> [ 2020.973244] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
> [ 2020.973773] Code: e8 ea 1a a0 48 c7 c7 20 eb 1a a0 e8 1b a7 43 e1 0f 0b 89 f1 48 c7 c2 9c 9a 1a a0 48 89 fe 48 c7 c7 48 eb 1a a0 e8 01 a7 43 e1 <0f> 0b be 57 00 00 00 48 c7 c7 70 eb 1a a0 e8 d5 ff ff ff be 73 00
> [ 2020.973854] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 770048 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.975411] RSP: 0018:ffffc90004c8b890 EFLAGS: 00010296
> [ 2020.976383] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 1, gen 0
> [ 2020.976841] RAX: 0000000000000071 RBX: 0000000000001000 RCX: 0000000000000000
> [ 2020.977656] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 774144 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.978272] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff
> [ 2020.979247] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 2, gen 0
> [ 2020.979853] RBP: ffff888139612700 R08: ffffffff81ca4a40 R09: 00000000ffffff74
> [ 2020.980629] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 778240 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.981224] R10: ffffffff81c35760 R11: ffffffff81c35760 R12: 000000013003fe00
> [ 2020.982211] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 3, gen 0
> [ 2020.982835] R13: 0000000000000001 R14: ffff888101a8c000 R15: 0000000000000004
> [ 2020.983624] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 782336 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.984226] FS:  00007f04d5acf740(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
> [ 2020.985216] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 4, gen 0
> [ 2020.985915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2020.986683] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 786432 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.987157] CR2: 0000000000d7e008 CR3: 00000001bb7f0000 CR4: 00000000000006a0
> [ 2020.988150] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 5, gen 0
> [ 2020.989019] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5092880384, 5092884480)
> [ 2020.990401] BTRFS warning (device nullb1): csum failed root 5 ino 2411 off 770048 csum 0x0203b7e3 expected csum 0x00000000 mirror 1
> [ 2020.991403] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 6, gen 0
> [ 2020.992443] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090144256, 5090148352)
> [ 2020.993277] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090148352, 5090152448)
> [ 2020.994148] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090152448, 5090156544)
> [ 2020.994989] BTRFS warning (device nullb1): csum hole found for disk bytenr range [5090156544, 5090160640)
> [ 2020.996065] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2179072 csum 0xa9788697 expected csum 0x00000000 mirror 1
> [ 2020.996156] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2183168 csum 0xa9788697 expected csum 0x00000000 mirror 1
> [ 2020.997107] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 7, gen 0
> [ 2020.998140] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 8, gen 0
> [ 2020.999004] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2187264 csum 0xa9788697 expected csum 0x00000000 mirror 1
> [ 2020.999793] BTRFS warning (device nullb1): csum failed root 5 ino 2427 off 2191360 csum 0xa9788697 expected csum 0x00000000 mirror 1
> [ 2021.000815] BTRFS error (device nullb1): bdev /dev/nullb1 errs: wr 1, rd 0, flush 0, corrupt 9, gen 0
>
> Resolving btrfs_lookup_bio_sums.cold+0x3f doesn't make any sense though:
> (gdb) l *btrfs_lookup_bio_sums+0x3f
> 0x1767f is in btrfs_lookup_bio_sums (fs/btrfs/file-item.c:372).
> 367     blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst)
> 368     {
> 369             struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> 370             struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> 371             struct btrfs_path *path;
> 372             const u32 sectorsize = fs_info->sectorsize;
> 373             const u32 csum_size = fs_info->csum_size;
> 374             u32 orig_len = bio->bi_iter.bi_size;
> 375             u64 orig_disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> 376             u64 cur_disk_bytenr;
> (gdb)
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time
  2021-12-09 11:08     ` Qu Wenruo
@ 2021-12-09 11:13       ` Johannes Thumshirn
  2022-01-12  0:33         ` Qu Wenruo
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2021-12-09 11:13 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 09/12/2021 12:08, Qu Wenruo wrote:
> 
> 
> On 2021/12/9 18:52, Johannes Thumshirn wrote:
>> On 09/12/2021 11:07, Johannes Thumshirn wrote:
>>>
>>>
>>> FYI the patchset doesn't apply cleanly to misc-next anymore. I've
>>> pulled your branch form github and queued it for testing on zoned
>>> devices.
>>>
>>> I'll report any findings.
>>>
>>
>> Unfortunately I do have something to report:
>>
>> generic/068     [ 2020.934379] BTRFS critical (device nullb1): corrupt leaf: root=5 block=4339220480 slot=64 ino=2431 file_offset=962560, invalid disk_bytenr for file extent, have 5100404224, should be aligned to 4096
> 
> No more error message after this line?
> 
> I thought it should be either write time or read time tree-checker
> error, but I can't see the message indicating the timing.
> 
> And yes, that disk_bytenr is indeed not aligned.
> 
>> [ 2020.938165] BTRFS: error (device nullb1) in btrfs_commit_transaction:2310: errno=-5 IO failure (Error while writing out transaction)
>> [ 2020.938688] BTRFS: error (device nullb1) in btrfs_finish_ordered_io:3110: errno=-5 IO failure
>> [ 2020.939982] BTRFS: error (device nullb1) in cleanup_transaction:1913: errno=-5 IO failure
>> [ 2020.941938] kernel BUG at fs/btrfs/ctree.h:3516!
> 
> And this is the most weird part, it's from assertfail(), but no line
> showing the line number.
> 
> Mind to provide the full dmesg?
> I guess some important lines are not included.
> 

Let me see if I can reproduce it with a higher logging level

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time
  2021-12-09 11:13       ` Johannes Thumshirn
@ 2022-01-12  0:33         ` Qu Wenruo
  2022-01-12  9:00           ` Johannes Thumshirn
  0 siblings, 1 reply; 26+ messages in thread
From: Qu Wenruo @ 2022-01-12  0:33 UTC (permalink / raw)
  To: Johannes Thumshirn, Qu Wenruo, linux-btrfs



On 2021/12/9 19:13, Johannes Thumshirn wrote:
> On 09/12/2021 12:08, Qu Wenruo wrote:
>>
>>
>> On 2021/12/9 18:52, Johannes Thumshirn wrote:
>>> On 09/12/2021 11:07, Johannes Thumshirn wrote:
>>>>
>>>>
>>>> FYI the patchset doesn't apply cleanly to misc-next anymore. I've
>>>> pulled your branch form github and queued it for testing on zoned
>>>> devices.
>>>>
>>>> I'll report any findings.
>>>>
>>>
>>> Unfortunately I do have something to report:
>>>
>>> generic/068     [ 2020.934379] BTRFS critical (device nullb1): corrupt leaf: root=5 block=4339220480 slot=64 ino=2431 file_offset=962560, invalid disk_bytenr for file extent, have 5100404224, should be aligned to 4096
>>
>> No more error message after this line?
>>
>> I thought it should be either write time or read time tree-checker
>> error, but I can't see the message indicating the timing.
>>
>> And yes, that disk_bytenr is indeed not aligned.
>>
>>> [ 2020.938165] BTRFS: error (device nullb1) in btrfs_commit_transaction:2310: errno=-5 IO failure (Error while writing out transaction)
>>> [ 2020.938688] BTRFS: error (device nullb1) in btrfs_finish_ordered_io:3110: errno=-5 IO failure
>>> [ 2020.939982] BTRFS: error (device nullb1) in cleanup_transaction:1913: errno=-5 IO failure
>>> [ 2020.941938] kernel BUG at fs/btrfs/ctree.h:3516!
>>
>> And this is the most weird part, it's from assertfail(), but no line
>> showing the line number.
>>
>> Mind to provide the full dmesg?
>> I guess some important lines are not included.
>>
>
> Let me see if I can reproduce it with a higher logging level

Any follow up on the crash?

I doubt it's related to the patchset, but I'm also curious why
tree-checker is triggered for a test without any dm-flakey/dm-rust/etc.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time
  2022-01-12  0:33         ` Qu Wenruo
@ 2022-01-12  9:00           ` Johannes Thumshirn
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2022-01-12  9:00 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-01-12  9:00 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-06  2:29 [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 01/17] btrfs: update an stale comment on btrfs_submit_bio_hook() Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 02/17] btrfs: save bio::bi_iter into btrfs_bio::iter before any endio Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 03/17] btrfs: use correct bio size for error message in btrfs_end_dio_bio() Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 04/17] btrfs: refactor btrfs_map_bio() Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 05/17] btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio() Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 06/17] btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes Qu Wenruo
2021-12-09 10:02   ` Johannes Thumshirn
2021-12-09 10:35     ` Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 07/17] btrfs: introduce btrfs_bio_split() helper Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 08/17] btrfs: make data buffered read path to handle split bio properly Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 09/17] btrfs: make data buffered write endio function to be split bio compatible Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 10/17] btrfs: make metadata write endio functions " Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 11/17] btrfs: make dec_and_test_compressed_bio() " Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 12/17] btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block() Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 13/17] btrfs: allow btrfs_map_bio() to split bio according to chunk stripe boundaries Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 14/17] btrfs: remove buffered IO stripe boundary calculation Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 15/17] btrfs: remove stripe boundary calculation for compressed IO Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 16/17] btrfs: remove the stripe boundary calculation for direct IO Qu Wenruo
2021-12-06  2:29 ` [PATCH v2 17/17] btrfs: unexport btrfs_get_io_geometry() Qu Wenruo
2021-12-09 10:06 ` [PATCH v2 00/17] btrfs: split bio at btrfs_map_bio() time Johannes Thumshirn
2021-12-09 10:52   ` Johannes Thumshirn
2021-12-09 11:08     ` Qu Wenruo
2021-12-09 11:13       ` Johannes Thumshirn
2022-01-12  0:33         ` Qu Wenruo
2022-01-12  9:00           ` Johannes Thumshirn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).