All of lore.kernel.org
 help / color / mirror / Atom feed
* consolidate btrfs checksumming, repair and bio splitting
@ 2022-09-01  7:41 Christoph Hellwig
  2022-09-01  7:42 ` [PATCH 01/17] block: export bio_split_rw Christoph Hellwig
                   ` (19 more replies)
  0 siblings, 20 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:41 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Hi all,

this series moves a large amount of duplicate code below btrfs_submit_bio
into what I call the 'storage' layer.  Instead of duplicating code to
checksum, check checksums and repair and split bios in all the caller
of btrfs_submit_bio (buffered I/O, direct I/O, compressed I/O, encoded
I/O), the work is done one in a central place, often more optiomal and
without slight changes in behavior.  Once that is done the upper layers
also don't need to split the bios for extent boundaries, as the storage
layer can do that itself, including splitting the bios for the zone
append limits for zoned I/O.

The split work is inspired by an earlier series from Qu, from which it
also reuses a few patches.

Note: this adds a fair amount of code to volumes.c, which already is
quite large.  It might make sense to add a prep patch to move
btrfs_submit_bio into a new bio.c file, but I only want to do that
if we have agreement on the move as the conflicts will be painful
when rebasing.

A git tree is also available:

    git://git.infradead.org/users/hch/misc.git btrfs-bio-split

Gitweb:

    http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-bio-split

Diffstat:
 block/blk-merge.c                |    3 
 fs/btrfs/btrfs_inode.h           |    5 
 fs/btrfs/compression.c           |  273 ++------------
 fs/btrfs/compression.h           |    3 
 fs/btrfs/ctree.h                 |   24 -
 fs/btrfs/disk-io.c               |  198 +---------
 fs/btrfs/disk-io.h               |    6 
 fs/btrfs/extent-io-tree.h        |   19 
 fs/btrfs/extent_io.c             |  739 ++------------------------------------
 fs/btrfs/extent_io.h             |   32 -
 fs/btrfs/file-item.c             |   67 +--
 fs/btrfs/inode.c                 |  664 ++++------------------------------
 fs/btrfs/ordered-data.h          |    1 
 fs/btrfs/tests/extent-io-tests.c |    1 
 fs/btrfs/volumes.c               |  753 +++++++++++++++++++++++++++++++--------
 fs/btrfs/volumes.h               |   83 +---
 fs/btrfs/zoned.c                 |   69 +--
 fs/btrfs/zoned.h                 |   16 
 fs/iomap/direct-io.c             |   10 
 include/linux/bio.h              |    4 
 include/linux/iomap.h            |    1 
 include/trace/events/btrfs.h     |    2 
 22 files changed, 924 insertions(+), 2049 deletions(-)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH 01/17] block: export bio_split_rw
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01  8:02   ` Johannes Thumshirn
                     ` (2 more replies)
  2022-09-01  7:42 ` [PATCH 02/17] btrfs: stop tracking failed reads in the I/O tree Christoph Hellwig
                   ` (18 subsequent siblings)
  19 siblings, 3 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

bio_split_rw can be used by file systems to split and incoming write
bio into multiple bios fitting the hardware limit for use as ZONE_APPEND
bios.  Export it for initial use in btrfs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c   | 3 ++-
 include/linux/bio.h | 4 ++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index ff04e9290715a..e68295462977b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -267,7 +267,7 @@ static bool bvec_split_segs(struct queue_limits *lim, const struct bio_vec *bv,
  * responsible for ensuring that @bs is only destroyed after processing of the
  * split bio has finished.
  */
-static struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,
+struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,
 		unsigned *segs, struct bio_set *bs, unsigned max_bytes)
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
@@ -317,6 +317,7 @@ static struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,
 	bio_clear_polled(bio);
 	return bio_split(bio, bytes >> SECTOR_SHIFT, GFP_NOIO, bs);
 }
+EXPORT_SYMBOL_GPL(bio_split_rw);
 
 /**
  * __bio_split_to_limits - split a bio to fit the queue limits
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ca22b06700a94..46890f8235401 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -12,6 +12,8 @@
 
 #define BIO_MAX_VECS		256U
 
+struct queue_limits;
+
 static inline unsigned int bio_max_segs(unsigned int nr_segs)
 {
 	return min(nr_segs, BIO_MAX_VECS);
@@ -375,6 +377,8 @@ static inline void bip_set_seed(struct bio_integrity_payload *bip,
 void bio_trim(struct bio *bio, sector_t offset, sector_t size);
 extern struct bio *bio_split(struct bio *bio, int sectors,
 			     gfp_t gfp, struct bio_set *bs);
+struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,
+		unsigned *segs, struct bio_set *bs, unsigned max_bytes);
 
 /**
  * bio_next_split - get next @sectors from a bio, splitting if necessary
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 02/17] btrfs: stop tracking failed reads in the I/O tree
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
  2022-09-01  7:42 ` [PATCH 01/17] block: export bio_split_rw Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01  8:55   ` Qu Wenruo
  2022-09-07 17:52   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 03/17] btrfs: move repair_io_failure to volumes.c Christoph Hellwig
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

There is a separate I/O failure tree to track the fail reads, so remove
the extra EXTENT_DAMAGED bit in the I/O tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent-io-tree.h        |  1 -
 fs/btrfs/extent_io.c             | 16 +---------------
 fs/btrfs/tests/extent-io-tests.c |  1 -
 include/trace/events/btrfs.h     |  1 -
 4 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index ec2f8b8e6faa7..e218bb56d86ac 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -17,7 +17,6 @@ struct io_failure_record;
 #define EXTENT_NODATASUM	(1U << 7)
 #define EXTENT_CLEAR_META_RESV	(1U << 8)
 #define EXTENT_NEED_WAIT	(1U << 9)
-#define EXTENT_DAMAGED		(1U << 10)
 #define EXTENT_NORESERVE	(1U << 11)
 #define EXTENT_QGROUP_RESERVED	(1U << 12)
 #define EXTENT_CLEAR_DATA_RESV	(1U << 13)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 591c191a58bc9..6ac76534d2c9e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2280,23 +2280,13 @@ int free_io_failure(struct extent_io_tree *failure_tree,
 		    struct io_failure_record *rec)
 {
 	int ret;
-	int err = 0;
 
 	set_state_failrec(failure_tree, rec->start, NULL);
 	ret = clear_extent_bits(failure_tree, rec->start,
 				rec->start + rec->len - 1,
 				EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret)
-		err = ret;
-
-	ret = clear_extent_bits(io_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_DAMAGED);
-	if (ret && !err)
-		err = ret;
-
 	kfree(rec);
-	return err;
+	return ret;
 }
 
 /*
@@ -2521,7 +2511,6 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	u64 start = bbio->file_offset + bio_offset;
 	struct io_failure_record *failrec;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	const u32 sectorsize = fs_info->sectorsize;
 	int ret;
 
@@ -2573,9 +2562,6 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 			      EXTENT_LOCKED | EXTENT_DIRTY);
 	if (ret >= 0) {
 		ret = set_state_failrec(failure_tree, start, failrec);
-		/* Set the bits in the inode's tree */
-		ret = set_extent_bits(tree, start, start + sectorsize - 1,
-				      EXTENT_DAMAGED);
 	} else if (ret < 0) {
 		kfree(failrec);
 		return ERR_PTR(ret);
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index a232b15b8021f..ba4b7601e8c0a 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -80,7 +80,6 @@ static void extent_flag_to_str(const struct extent_state *state, char *dest)
 	PRINT_ONE_FLAG(state, dest, cur, NODATASUM);
 	PRINT_ONE_FLAG(state, dest, cur, CLEAR_META_RESV);
 	PRINT_ONE_FLAG(state, dest, cur, NEED_WAIT);
-	PRINT_ONE_FLAG(state, dest, cur, DAMAGED);
 	PRINT_ONE_FLAG(state, dest, cur, NORESERVE);
 	PRINT_ONE_FLAG(state, dest, cur, QGROUP_RESERVED);
 	PRINT_ONE_FLAG(state, dest, cur, CLEAR_DATA_RESV);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 73df80d462dc8..f8a4118b16574 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -154,7 +154,6 @@ FLUSH_STATES
 	{ EXTENT_NODATASUM,		"NODATASUM"},		\
 	{ EXTENT_CLEAR_META_RESV,	"CLEAR_META_RESV"},	\
 	{ EXTENT_NEED_WAIT,		"NEED_WAIT"},		\
-	{ EXTENT_DAMAGED,		"DAMAGED"},		\
 	{ EXTENT_NORESERVE,		"NORESERVE"},		\
 	{ EXTENT_QGROUP_RESERVED,	"QGROUP_RESERVED"},	\
 	{ EXTENT_CLEAR_DATA_RESV,	"CLEAR_DATA_RESV"},	\
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 03/17] btrfs: move repair_io_failure to volumes.c
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
  2022-09-01  7:42 ` [PATCH 01/17] block: export bio_split_rw Christoph Hellwig
  2022-09-01  7:42 ` [PATCH 02/17] btrfs: stop tracking failed reads in the I/O tree Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-07 17:54   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer Christoph Hellwig
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

repair_io_failure ties directly into all the glory low-level details of
mapping a bio with a logic address to the actual physical location.
Move it right below btrfs_submit_bio to keep all the related logic
together.

Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that
repair_io_failure is available in a header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/disk-io.c   |  24 +++++++++
 fs/btrfs/extent_io.c | 118 +------------------------------------------
 fs/btrfs/extent_io.h |   1 -
 fs/btrfs/volumes.c   |  91 +++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h   |   3 ++
 5 files changed, 120 insertions(+), 117 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 912e0b2bd0c5f..a88d6c3b59042 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -249,6 +249,30 @@ int btrfs_verify_level_key(struct extent_buffer *eb, int level,
 	return ret;
 }
 
+static int btrfs_repair_eb_io_failure(const struct extent_buffer *eb,
+				      int mirror_num)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	u64 start = eb->start;
+	int i, num_pages = num_extent_pages(eb);
+	int ret = 0;
+
+	if (sb_rdonly(fs_info->sb))
+		return -EROFS;
+
+	for (i = 0; i < num_pages; i++) {
+		struct page *p = eb->pages[i];
+
+		ret = btrfs_repair_io_failure(fs_info, 0, start, PAGE_SIZE,
+				start, p, start - page_offset(p), mirror_num);
+		if (ret)
+			break;
+		start += PAGE_SIZE;
+	}
+
+	return ret;
+}
+
 /*
  * helper to read a given tree block, doing retries as required when
  * the checksums don't match and we have alternate mirrors to try.
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6ac76534d2c9e..c83cc5677a08a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2289,120 +2289,6 @@ int free_io_failure(struct extent_io_tree *failure_tree,
 	return ret;
 }
 
-/*
- * this bypasses the standard btrfs submit functions deliberately, as
- * the standard behavior is to write all copies in a raid setup. here we only
- * want to write the one bad copy. so we do the mapping for ourselves and issue
- * submit_bio directly.
- * to avoid any synchronization issues, wait for the data after writing, which
- * actually prevents the read that triggered the error from finishing.
- * currently, there can be no more than two copies of every data bit. thus,
- * exactly one rewrite is required.
- */
-static int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
-			     u64 length, u64 logical, struct page *page,
-			     unsigned int pg_offset, int mirror_num)
-{
-	struct btrfs_device *dev;
-	struct bio_vec bvec;
-	struct bio bio;
-	u64 map_length = 0;
-	u64 sector;
-	struct btrfs_io_context *bioc = NULL;
-	int ret = 0;
-
-	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
-	BUG_ON(!mirror_num);
-
-	if (btrfs_repair_one_zone(fs_info, logical))
-		return 0;
-
-	map_length = length;
-
-	/*
-	 * Avoid races with device replace and make sure our bioc has devices
-	 * associated to its stripes that don't go away while we are doing the
-	 * read repair operation.
-	 */
-	btrfs_bio_counter_inc_blocked(fs_info);
-	if (btrfs_is_parity_mirror(fs_info, logical, length)) {
-		/*
-		 * Note that we don't use BTRFS_MAP_WRITE because it's supposed
-		 * to update all raid stripes, but here we just want to correct
-		 * bad stripe, thus BTRFS_MAP_READ is abused to only get the bad
-		 * stripe's dev and sector.
-		 */
-		ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
-				      &map_length, &bioc, 0);
-		if (ret)
-			goto out_counter_dec;
-		ASSERT(bioc->mirror_num == 1);
-	} else {
-		ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical,
-				      &map_length, &bioc, mirror_num);
-		if (ret)
-			goto out_counter_dec;
-		BUG_ON(mirror_num != bioc->mirror_num);
-	}
-
-	sector = bioc->stripes[bioc->mirror_num - 1].physical >> 9;
-	dev = bioc->stripes[bioc->mirror_num - 1].dev;
-	btrfs_put_bioc(bioc);
-
-	if (!dev || !dev->bdev ||
-	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state)) {
-		ret = -EIO;
-		goto out_counter_dec;
-	}
-
-	bio_init(&bio, dev->bdev, &bvec, 1, REQ_OP_WRITE | REQ_SYNC);
-	bio.bi_iter.bi_sector = sector;
-	__bio_add_page(&bio, page, length, pg_offset);
-
-	btrfsic_check_bio(&bio);
-	ret = submit_bio_wait(&bio);
-	if (ret) {
-		/* try to remap that extent elsewhere? */
-		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
-		goto out_bio_uninit;
-	}
-
-	btrfs_info_rl_in_rcu(fs_info,
-		"read error corrected: ino %llu off %llu (dev %s sector %llu)",
-				  ino, start,
-				  rcu_str_deref(dev->name), sector);
-	ret = 0;
-
-out_bio_uninit:
-	bio_uninit(&bio);
-out_counter_dec:
-	btrfs_bio_counter_dec(fs_info);
-	return ret;
-}
-
-int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
-{
-	struct btrfs_fs_info *fs_info = eb->fs_info;
-	u64 start = eb->start;
-	int i, num_pages = num_extent_pages(eb);
-	int ret = 0;
-
-	if (sb_rdonly(fs_info->sb))
-		return -EROFS;
-
-	for (i = 0; i < num_pages; i++) {
-		struct page *p = eb->pages[i];
-
-		ret = repair_io_failure(fs_info, 0, start, PAGE_SIZE, start, p,
-					start - page_offset(p), mirror_num);
-		if (ret)
-			break;
-		start += PAGE_SIZE;
-	}
-
-	return ret;
-}
-
 static int next_mirror(const struct io_failure_record *failrec, int cur_mirror)
 {
 	if (cur_mirror == failrec->num_copies)
@@ -2460,7 +2346,7 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 	mirror = failrec->this_mirror;
 	do {
 		mirror = prev_mirror(failrec, mirror);
-		repair_io_failure(fs_info, ino, start, failrec->len,
+		btrfs_repair_io_failure(fs_info, ino, start, failrec->len,
 				  failrec->logical, page, pg_offset, mirror);
 	} while (mirror != failrec->failed_mirror);
 
@@ -2600,7 +2486,7 @@ int btrfs_repair_one_sector(struct inode *inode, struct btrfs_bio *failed_bbio,
 	 *
 	 * Since we're only doing repair for one sector, we only need to get
 	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for repair_io_failure to do the rest for us.
+	 * everything for btrfs_repair_io_failure to do the rest for us.
 	 */
 	failrec->this_mirror = next_mirror(failrec, failrec->this_mirror);
 	if (failrec->this_mirror == failrec->failed_mirror) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 69a86ae6fd508..e653e64598bf7 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -243,7 +243,6 @@ void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array);
 
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
-int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num);
 
 /*
  * When IO fails, either with EIO or csum verification fails, we
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 19f7858aa2b91..dff735e36da96 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6902,6 +6902,97 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror
 	}
 }
 
+/*
+ * Submit a repair write.
+ *
+ * This bypasses btrfs_submit_bio deliberately, as that writes all copies in a
+ * RAID setup.  Here we only want to write the one bad copy, so we do the
+ * mapping ourselves and submit the bio directly.
+ *
+ * The I/O is іssued sychronously to block the repair read completion from
+ * freeing the bio.
+ */
+int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
+			    u64 length, u64 logical, struct page *page,
+			    unsigned int pg_offset, int mirror_num)
+{
+	struct btrfs_device *dev;
+	struct bio_vec bvec;
+	struct bio bio;
+	u64 map_length = 0;
+	u64 sector;
+	struct btrfs_io_context *bioc = NULL;
+	int ret = 0;
+
+	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
+	BUG_ON(!mirror_num);
+
+	if (btrfs_repair_one_zone(fs_info, logical))
+		return 0;
+
+	map_length = length;
+
+	/*
+	 * Avoid races with device replace and make sure our bioc has devices
+	 * associated to its stripes that don't go away while we are doing the
+	 * read repair operation.
+	 */
+	btrfs_bio_counter_inc_blocked(fs_info);
+	if (btrfs_is_parity_mirror(fs_info, logical, length)) {
+		/*
+		 * Note that we don't use BTRFS_MAP_WRITE because it's supposed
+		 * to update all raid stripes, but here we just want to correct
+		 * bad stripe, thus BTRFS_MAP_READ is abused to only get the bad
+		 * stripe's dev and sector.
+		 */
+		ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
+				      &map_length, &bioc, 0);
+		if (ret)
+			goto out_counter_dec;
+		ASSERT(bioc->mirror_num == 1);
+	} else {
+		ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical,
+				      &map_length, &bioc, mirror_num);
+		if (ret)
+			goto out_counter_dec;
+		BUG_ON(mirror_num != bioc->mirror_num);
+	}
+
+	sector = bioc->stripes[bioc->mirror_num - 1].physical >> 9;
+	dev = bioc->stripes[bioc->mirror_num - 1].dev;
+	btrfs_put_bioc(bioc);
+
+	if (!dev || !dev->bdev ||
+	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state)) {
+		ret = -EIO;
+		goto out_counter_dec;
+	}
+
+	bio_init(&bio, dev->bdev, &bvec, 1, REQ_OP_WRITE | REQ_SYNC);
+	bio.bi_iter.bi_sector = sector;
+	__bio_add_page(&bio, page, length, pg_offset);
+
+	btrfsic_check_bio(&bio);
+	ret = submit_bio_wait(&bio);
+	if (ret) {
+		/* try to remap that extent elsewhere? */
+		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
+		goto out_bio_uninit;
+	}
+
+	btrfs_info_rl_in_rcu(fs_info,
+		"read error corrected: ino %llu off %llu (dev %s sector %llu)",
+				  ino, start,
+				  rcu_str_deref(dev->name), sector);
+	ret = 0;
+
+out_bio_uninit:
+	bio_uninit(&bio);
+out_counter_dec:
+	btrfs_bio_counter_dec(fs_info);
+	return ret;
+}
+
 static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
 				      const struct btrfs_fs_devices *fs_devices)
 {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f19a1cd1bfcf2..b368356fa78a1 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -598,6 +598,9 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 					    u64 type);
 void btrfs_mapping_tree_free(struct extent_map_tree *tree);
 void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror_num);
+int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
+			    u64 length, u64 logical, struct page *page,
+			    unsigned int pg_offset, int mirror_num);
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 		       fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (2 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 03/17] btrfs: move repair_io_failure to volumes.c Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01  9:04   ` Qu Wenruo
  2022-09-07 18:15   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 05/17] btrfs: handle checksum generation in " Christoph Hellwig
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Currently btrfs handles checksum validation and repair in the end I/O
handler for the btrfs_bio.  This leads to a lot of duplicate code
plus issues with variying semantics or bugs, e.g.

 - the until recently completetly broken repair for compressed extents
 - the fact that encoded reads validate the checksums but do not kick
   of read repair
 - the inconsistent checking of the BTRFS_FS_STATE_NO_CSUMS flag

This commit revamps the checksum validation and repair code to instead
work below the btrfs_submit_bio interfaces.  For this to work we need
to make sure an inode is available, so that is added as a parameter
to btrfs_bio_alloc.  With that btrfs_submit_bio can preload
btrfs_bio.csum from the csum tree without help from the upper layers,
and the low-level I/O completion can iterate over the bio and verify
the checksums.

In case of a checksum failure (or a plain old I/O error), the repair
is now kicked off before the upper level ->end_io handler is invoked.
Tracking of the repair status is massively simplified by just keeping
a small failed_bio structure per bio with failed sectors and otherwise
using the information in the repair bio.  The per-inode I/O failure
tree can be entirely removed.

The saved bvec_iter in the btrfs_bio is now competely managed by
btrfs_submit_bio and must not be accessed by the callers.

There is one significant behavior change here:  If repair fails or
is impossible to start with, the whole bio will be failed to the
upper layer.  This is the behavior that all I/O submitters execept
for buffered I/O already emulated in their end_io handler.  For
buffered I/O this now means that a large readahead request can
fail due to a single bad sector, but as readahead errors are igored
the following readpage if the sector is actually accessed will
still be able to read.  This also matches the I/O failure handling
in other file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/btrfs_inode.h       |   5 -
 fs/btrfs/compression.c       |  54 +----
 fs/btrfs/ctree.h             |  13 +-
 fs/btrfs/extent-io-tree.h    |  18 --
 fs/btrfs/extent_io.c         | 451 +----------------------------------
 fs/btrfs/extent_io.h         |  28 ---
 fs/btrfs/file-item.c         |  42 ++--
 fs/btrfs/inode.c             | 320 ++++---------------------
 fs/btrfs/volumes.c           | 238 ++++++++++++++++--
 fs/btrfs/volumes.h           |  49 ++--
 include/trace/events/btrfs.h |   1 -
 11 files changed, 320 insertions(+), 899 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index b160b8e124e01..4cb9898869019 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -91,11 +91,6 @@ struct btrfs_inode {
 	/* the io_tree does range state (DIRTY, LOCKED etc) */
 	struct extent_io_tree io_tree;
 
-	/* special utility tree used to record which mirrors have already been
-	 * tried when checksums fail for a given block
-	 */
-	struct extent_io_tree io_failure_tree;
-
 	/*
 	 * Keep track of where the inode has extent items mapped in order to
 	 * make sure the i_size adjustments are accurate
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 1c77de3239bc4..f932415a4f1df 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -159,53 +159,15 @@ static void finish_compressed_bio_read(struct compressed_bio *cb)
 	kfree(cb);
 }
 
-/*
- * Verify the checksums and kick off repair if needed on the uncompressed data
- * before decompressing it into the original bio and freeing the uncompressed
- * pages.
- */
 static void end_compressed_bio_read(struct btrfs_bio *bbio)
 {
 	struct compressed_bio *cb = bbio->private;
-	struct inode *inode = cb->inode;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_inode *bi = BTRFS_I(inode);
-	bool csum = !(bi->flags & BTRFS_INODE_NODATASUM) &&
-		    !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state);
-	blk_status_t status = bbio->bio.bi_status;
-	struct bvec_iter iter;
-	struct bio_vec bv;
-	u32 offset;
-
-	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
-		u64 start = bbio->file_offset + offset;
-
-		if (!status &&
-		    (!csum || !btrfs_check_data_csum(inode, bbio, offset,
-						     bv.bv_page, bv.bv_offset))) {
-			clean_io_failure(fs_info, &bi->io_failure_tree,
-					 &bi->io_tree, start, bv.bv_page,
-					 btrfs_ino(bi), bv.bv_offset);
-		} else {
-			int ret;
-
-			refcount_inc(&cb->pending_ios);
-			ret = btrfs_repair_one_sector(inode, bbio, offset,
-						      bv.bv_page, bv.bv_offset,
-						      btrfs_submit_data_read_bio);
-			if (ret) {
-				refcount_dec(&cb->pending_ios);
-				status = errno_to_blk_status(ret);
-			}
-		}
-	}
 
-	if (status)
-		cb->status = status;
+	if (bbio->bio.bi_status)
+		cb->status = bbio->bio.bi_status;
 
 	if (refcount_dec_and_test(&cb->pending_ios))
 		finish_compressed_bio_read(cb);
-	btrfs_bio_free_csum(bbio);
 	bio_put(&bbio->bio);
 }
 
@@ -342,7 +304,7 @@ static struct bio *alloc_compressed_bio(struct compressed_bio *cb, u64 disk_byte
 	struct bio *bio;
 	int ret;
 
-	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, endio_func, cb);
+	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, cb->inode, endio_func, cb);
 	bio->bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
 
 	em = btrfs_get_chunk_map(fs_info, disk_bytenr, fs_info->sectorsize);
@@ -778,10 +740,6 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 			submit = true;
 
 		if (submit) {
-			/* Save the original iter for read repair */
-			if (bio_op(comp_bio) == REQ_OP_READ)
-				btrfs_bio(comp_bio)->iter = comp_bio->bi_iter;
-
 			/*
 			 * Save the initial offset of this chunk, as there
 			 * is no direct correlation between compressed pages and
@@ -790,12 +748,6 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 			 */
 			btrfs_bio(comp_bio)->file_offset = file_offset;
 
-			ret = btrfs_lookup_bio_sums(inode, comp_bio, NULL);
-			if (ret) {
-				btrfs_bio_end_io(btrfs_bio(comp_bio), ret);
-				break;
-			}
-
 			ASSERT(comp_bio->bi_iter.bi_size);
 			btrfs_submit_bio(fs_info, comp_bio, mirror_num);
 			comp_bio = NULL;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0069bc86c04f1..3dcb0d5f8faa0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3344,7 +3344,7 @@ int btrfs_find_orphan_item(struct btrfs_root *root, u64 offset);
 /* file-item.c */
 int btrfs_del_csums(struct btrfs_trans_handle *trans,
 		    struct btrfs_root *root, u64 bytenr, u64 len);
-blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst);
+int btrfs_lookup_bio_sums(struct btrfs_bio *bbio);
 int btrfs_insert_hole_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root, u64 objectid, u64 pos,
 			     u64 num_bytes);
@@ -3375,15 +3375,8 @@ u64 btrfs_file_extent_end(const struct btrfs_path *path);
 void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirror_num);
 void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
 			int mirror_num, enum btrfs_compression_type compress_type);
-int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
-			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
-int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
-			  u32 bio_offset, struct page *page, u32 pgoff);
-unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
-				    u32 bio_offset, struct page *page,
-				    u64 start, u64 end);
-int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
-			  u32 bio_offset, struct page *page, u32 pgoff);
+bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
+			u32 bio_offset, struct bio_vec *bv);
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
 					   u64 start, u64 len);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index e218bb56d86ac..a1afe6e15943e 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -4,7 +4,6 @@
 #define BTRFS_EXTENT_IO_TREE_H
 
 struct extent_changeset;
-struct io_failure_record;
 
 /* Bits for the extent state */
 #define EXTENT_DIRTY		(1U << 0)
@@ -55,7 +54,6 @@ enum {
 	IO_TREE_FS_EXCLUDED_EXTENTS,
 	IO_TREE_BTREE_INODE_IO,
 	IO_TREE_INODE_IO,
-	IO_TREE_INODE_IO_FAILURE,
 	IO_TREE_RELOC_BLOCKS,
 	IO_TREE_TRANS_DIRTY_PAGES,
 	IO_TREE_ROOT_DIRTY_LOG_PAGES,
@@ -88,8 +86,6 @@ struct extent_state {
 	refcount_t refs;
 	u32 state;
 
-	struct io_failure_record *failrec;
-
 #ifdef CONFIG_BTRFS_DEBUG
 	struct list_head leak_list;
 #endif
@@ -246,18 +242,4 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 			       u64 *end, u64 max_bytes,
 			       struct extent_state **cached_state);
 
-/* This should be reworked in the future and put elsewhere. */
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start);
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec);
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start,
-		u64 end);
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec);
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset);
-
 #endif /* BTRFS_EXTENT_IO_TREE_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c83cc5677a08a..d8c43e2111a99 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -326,7 +326,6 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
 	if (!state)
 		return state;
 	state->state = 0;
-	state->failrec = NULL;
 	RB_CLEAR_NODE(&state->rb_node);
 	btrfs_leak_debug_add(&leak_lock, &state->leak_list, &states);
 	refcount_set(&state->refs, 1);
@@ -2159,66 +2158,6 @@ u64 count_range_bits(struct extent_io_tree *tree,
 	return total_bytes;
 }
 
-/*
- * set the private field for a given byte offset in the tree.  If there isn't
- * an extent_state there already, this does nothing.
- */
-int set_state_failrec(struct extent_io_tree *tree, u64 start,
-		      struct io_failure_record *failrec)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	int ret = 0;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state->failrec = failrec;
-out:
-	spin_unlock(&tree->lock);
-	return ret;
-}
-
-struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start)
-{
-	struct rb_node *node;
-	struct extent_state *state;
-	struct io_failure_record *failrec;
-
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		failrec = ERR_PTR(-ENOENT);
-		goto out;
-	}
-
-	failrec = state->failrec;
-out:
-	spin_unlock(&tree->lock);
-	return failrec;
-}
-
 /*
  * searches a range in the state tree for a given mask.
  * If 'filled' == 1, this returns 1 only if every extent in the tree
@@ -2275,258 +2214,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-int free_io_failure(struct extent_io_tree *failure_tree,
-		    struct extent_io_tree *io_tree,
-		    struct io_failure_record *rec)
-{
-	int ret;
-
-	set_state_failrec(failure_tree, rec->start, NULL);
-	ret = clear_extent_bits(failure_tree, rec->start,
-				rec->start + rec->len - 1,
-				EXTENT_LOCKED | EXTENT_DIRTY);
-	kfree(rec);
-	return ret;
-}
-
-static int next_mirror(const struct io_failure_record *failrec, int cur_mirror)
-{
-	if (cur_mirror == failrec->num_copies)
-		return cur_mirror + 1 - failrec->num_copies;
-	return cur_mirror + 1;
-}
-
-static int prev_mirror(const struct io_failure_record *failrec, int cur_mirror)
-{
-	if (cur_mirror == 1)
-		return failrec->num_copies;
-	return cur_mirror - 1;
-}
-
-/*
- * each time an IO finishes, we do a fast check in the IO failure tree
- * to see if we need to process or clean up an io_failure_record
- */
-int clean_io_failure(struct btrfs_fs_info *fs_info,
-		     struct extent_io_tree *failure_tree,
-		     struct extent_io_tree *io_tree, u64 start,
-		     struct page *page, u64 ino, unsigned int pg_offset)
-{
-	u64 private;
-	struct io_failure_record *failrec;
-	struct extent_state *state;
-	int mirror;
-	int ret;
-
-	private = 0;
-	ret = count_range_bits(failure_tree, &private, (u64)-1, 1,
-			       EXTENT_DIRTY, 0);
-	if (!ret)
-		return 0;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (IS_ERR(failrec))
-		return 0;
-
-	BUG_ON(!failrec->this_mirror);
-
-	if (sb_rdonly(fs_info->sb))
-		goto out;
-
-	spin_lock(&io_tree->lock);
-	state = find_first_extent_bit_state(io_tree,
-					    failrec->start,
-					    EXTENT_LOCKED);
-	spin_unlock(&io_tree->lock);
-
-	if (!state || state->start > failrec->start ||
-	    state->end < failrec->start + failrec->len - 1)
-		goto out;
-
-	mirror = failrec->this_mirror;
-	do {
-		mirror = prev_mirror(failrec, mirror);
-		btrfs_repair_io_failure(fs_info, ino, start, failrec->len,
-				  failrec->logical, page, pg_offset, mirror);
-	} while (mirror != failrec->failed_mirror);
-
-out:
-	free_io_failure(failure_tree, io_tree, failrec);
-	return 0;
-}
-
-/*
- * Can be called when
- * - hold extent lock
- * - under ordered extent
- * - the inode is freeing
- */
-void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
-{
-	struct extent_io_tree *failure_tree = &inode->io_failure_tree;
-	struct io_failure_record *failrec;
-	struct extent_state *state, *next;
-
-	if (RB_EMPTY_ROOT(&failure_tree->state))
-		return;
-
-	spin_lock(&failure_tree->lock);
-	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
-	while (state) {
-		if (state->start > end)
-			break;
-
-		ASSERT(state->end <= end);
-
-		next = next_state(state);
-
-		failrec = state->failrec;
-		free_extent_state(state);
-		kfree(failrec);
-
-		state = next;
-	}
-	spin_unlock(&failure_tree->lock);
-}
-
-static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
-							     struct btrfs_bio *bbio,
-							     unsigned int bio_offset)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	u64 start = bbio->file_offset + bio_offset;
-	struct io_failure_record *failrec;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	const u32 sectorsize = fs_info->sectorsize;
-	int ret;
-
-	failrec = get_state_failrec(failure_tree, start);
-	if (!IS_ERR(failrec)) {
-		btrfs_debug(fs_info,
-	"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu",
-			failrec->logical, failrec->start, failrec->len);
-		/*
-		 * when data can be on disk more than twice, add to failrec here
-		 * (e.g. with a list for failed_mirror) to make
-		 * clean_io_failure() clean all those errors at once.
-		 */
-		ASSERT(failrec->this_mirror == bbio->mirror_num);
-		ASSERT(failrec->len == fs_info->sectorsize);
-		return failrec;
-	}
-
-	failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
-	if (!failrec)
-		return ERR_PTR(-ENOMEM);
-
-	failrec->start = start;
-	failrec->len = sectorsize;
-	failrec->failed_mirror = bbio->mirror_num;
-	failrec->this_mirror = bbio->mirror_num;
-	failrec->logical = (bbio->iter.bi_sector << SECTOR_SHIFT) + bio_offset;
-
-	btrfs_debug(fs_info,
-		    "new io failure record logical %llu start %llu",
-		    failrec->logical, start);
-
-	failrec->num_copies = btrfs_num_copies(fs_info, failrec->logical, sectorsize);
-	if (failrec->num_copies == 1) {
-		/*
-		 * We only have a single copy of the data, so don't bother with
-		 * all the retry and error correction code that follows. No
-		 * matter what the error is, it is very likely to persist.
-		 */
-		btrfs_debug(fs_info,
-			"cannot repair logical %llu num_copies %d",
-			failrec->logical, failrec->num_copies);
-		kfree(failrec);
-		return ERR_PTR(-EIO);
-	}
-
-	/* Set the bits in the private failure tree */
-	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
-			      EXTENT_LOCKED | EXTENT_DIRTY);
-	if (ret >= 0) {
-		ret = set_state_failrec(failure_tree, start, failrec);
-	} else if (ret < 0) {
-		kfree(failrec);
-		return ERR_PTR(ret);
-	}
-
-	return failrec;
-}
-
-int btrfs_repair_one_sector(struct inode *inode, struct btrfs_bio *failed_bbio,
-			    u32 bio_offset, struct page *page, unsigned int pgoff,
-			    submit_bio_hook_t *submit_bio_hook)
-{
-	u64 start = failed_bbio->file_offset + bio_offset;
-	struct io_failure_record *failrec;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct bio *failed_bio = &failed_bbio->bio;
-	const int icsum = bio_offset >> fs_info->sectorsize_bits;
-	struct bio *repair_bio;
-	struct btrfs_bio *repair_bbio;
-
-	btrfs_debug(fs_info,
-		   "repair read error: read error at %llu", start);
-
-	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
-
-	failrec = btrfs_get_io_failure_record(inode, failed_bbio, bio_offset);
-	if (IS_ERR(failrec))
-		return PTR_ERR(failrec);
-
-	/*
-	 * There are two premises:
-	 * a) deliver good data to the caller
-	 * b) correct the bad sectors on disk
-	 *
-	 * Since we're only doing repair for one sector, we only need to get
-	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for btrfs_repair_io_failure to do the rest for us.
-	 */
-	failrec->this_mirror = next_mirror(failrec, failrec->this_mirror);
-	if (failrec->this_mirror == failrec->failed_mirror) {
-		btrfs_debug(fs_info,
-			"failed to repair num_copies %d this_mirror %d failed_mirror %d",
-			failrec->num_copies, failrec->this_mirror, failrec->failed_mirror);
-		free_io_failure(failure_tree, tree, failrec);
-		return -EIO;
-	}
-
-	repair_bio = btrfs_bio_alloc(1, REQ_OP_READ, failed_bbio->end_io,
-				     failed_bbio->private);
-	repair_bbio = btrfs_bio(repair_bio);
-	repair_bbio->file_offset = start;
-	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
-
-	if (failed_bbio->csum) {
-		const u32 csum_size = fs_info->csum_size;
-
-		repair_bbio->csum = repair_bbio->csum_inline;
-		memcpy(repair_bbio->csum,
-		       failed_bbio->csum + csum_size * icsum, csum_size);
-	}
-
-	bio_add_page(repair_bio, page, failrec->len, pgoff);
-	repair_bbio->iter = repair_bio->bi_iter;
-
-	btrfs_debug(btrfs_sb(inode->i_sb),
-		    "repair read error: submitting new read to mirror %d",
-		    failrec->this_mirror);
-
-	/*
-	 * At this point we have a bio, so any errors from submit_bio_hook()
-	 * will be handled by the endio on the repair_bio, so we can't return an
-	 * error here.
-	 */
-	submit_bio_hook(inode, repair_bio, failrec->this_mirror, 0);
-	return BLK_STS_OK;
-}
-
 static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
@@ -2555,84 +2242,6 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
-static void end_sector_io(struct page *page, u64 offset, bool uptodate)
-{
-	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
-	const u32 sectorsize = inode->root->fs_info->sectorsize;
-	struct extent_state *cached = NULL;
-
-	end_page_read(page, uptodate, offset, sectorsize);
-	if (uptodate)
-		set_extent_uptodate(&inode->io_tree, offset,
-				    offset + sectorsize - 1, &cached, GFP_ATOMIC);
-	unlock_extent_cached_atomic(&inode->io_tree, offset,
-				    offset + sectorsize - 1, &cached);
-}
-
-static void submit_data_read_repair(struct inode *inode,
-				    struct btrfs_bio *failed_bbio,
-				    u32 bio_offset, const struct bio_vec *bvec,
-				    unsigned int error_bitmap)
-{
-	const unsigned int pgoff = bvec->bv_offset;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct page *page = bvec->bv_page;
-	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
-	const u64 end = start + bvec->bv_len - 1;
-	const u32 sectorsize = fs_info->sectorsize;
-	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
-	int i;
-
-	BUG_ON(bio_op(&failed_bbio->bio) == REQ_OP_WRITE);
-
-	/* This repair is only for data */
-	ASSERT(is_data_inode(inode));
-
-	/* We're here because we had some read errors or csum mismatch */
-	ASSERT(error_bitmap);
-
-	/*
-	 * We only get called on buffered IO, thus page must be mapped and bio
-	 * must not be cloned.
-	 */
-	ASSERT(page->mapping && !bio_flagged(&failed_bbio->bio, BIO_CLONED));
-
-	/* Iterate through all the sectors in the range */
-	for (i = 0; i < nr_bits; i++) {
-		const unsigned int offset = i * sectorsize;
-		bool uptodate = false;
-		int ret;
-
-		if (!(error_bitmap & (1U << i))) {
-			/*
-			 * This sector has no error, just end the page read
-			 * and unlock the range.
-			 */
-			uptodate = true;
-			goto next;
-		}
-
-		ret = btrfs_repair_one_sector(inode, failed_bbio,
-				bio_offset + offset, page, pgoff + offset,
-				btrfs_submit_data_read_bio);
-		if (!ret) {
-			/*
-			 * We have submitted the read repair, the page release
-			 * will be handled by the endio function of the
-			 * submitted repair bio.
-			 * Thus we don't need to do any thing here.
-			 */
-			continue;
-		}
-		/*
-		 * Continue on failed repair, otherwise the remaining sectors
-		 * will not be properly unlocked.
-		 */
-next:
-		end_sector_io(page, start + offset, uptodate);
-	}
-}
-
 /* lots and lots of room for performance fixes in the end_bio funcs */
 
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
@@ -2835,7 +2444,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
 	struct bio_vec *bvec;
-	struct extent_io_tree *tree, *failure_tree;
 	struct processed_extent processed = { 0 };
 	/*
 	 * The offset to the beginning of a bio, since one bio can never be
@@ -2852,8 +2460,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
-		unsigned int error_bitmap = (unsigned int)-1;
-		bool repair = false;
 		u64 start;
 		u64 end;
 		u32 len;
@@ -2862,8 +2468,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 			"end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u",
 			bio->bi_iter.bi_sector, bio->bi_status,
 			bbio->mirror_num);
-		tree = &BTRFS_I(inode)->io_tree;
-		failure_tree = &BTRFS_I(inode)->io_failure_tree;
 
 		/*
 		 * We always issue full-sector reads, but if some block in a
@@ -2887,27 +2491,15 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 		len = bvec->bv_len;
 
 		mirror = bbio->mirror_num;
-		if (likely(uptodate)) {
-			if (is_data_inode(inode)) {
-				error_bitmap = btrfs_verify_data_csum(bbio,
-						bio_offset, page, start, end);
-				if (error_bitmap)
-					uptodate = false;
-			} else {
-				if (btrfs_validate_metadata_buffer(bbio,
-						page, start, end, mirror))
-					uptodate = false;
-			}
-		}
+		if (uptodate && !is_data_inode(inode) &&
+		    btrfs_validate_metadata_buffer(bbio, page, start, end,
+						   mirror))
+			uptodate = false;
 
 		if (likely(uptodate)) {
 			loff_t i_size = i_size_read(inode);
 			pgoff_t end_index = i_size >> PAGE_SHIFT;
 
-			clean_io_failure(BTRFS_I(inode)->root->fs_info,
-					 failure_tree, tree, start, page,
-					 btrfs_ino(BTRFS_I(inode)), 0);
-
 			/*
 			 * Zero out the remaining part if this range straddles
 			 * i_size.
@@ -2924,19 +2516,7 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 				zero_user_segment(page, zero_start,
 						  offset_in_page(end) + 1);
 			}
-		} else if (is_data_inode(inode)) {
-			/*
-			 * Only try to repair bios that actually made it to a
-			 * device.  If the bio failed to be submitted mirror
-			 * is 0 and we need to fail it without retrying.
-			 *
-			 * This also includes the high level bios for compressed
-			 * extents - these never make it to a device and repair
-			 * is already handled on the lower compressed bio.
-			 */
-			if (mirror > 0)
-				repair = true;
-		} else {
+		} else if (!is_data_inode(inode)) {
 			struct extent_buffer *eb;
 
 			eb = find_extent_buffer_readpage(fs_info, page, start);
@@ -2945,19 +2525,10 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 			atomic_dec(&eb->io_pages);
 		}
 
-		if (repair) {
-			/*
-			 * submit_data_read_repair() will handle all the good
-			 * and bad sectors, we just continue to the next bvec.
-			 */
-			submit_data_read_repair(inode, bbio, bio_offset, bvec,
-						error_bitmap);
-		} else {
-			/* Update page status and unlock */
-			end_page_read(page, uptodate, start, len);
-			endio_readpage_release_extent(&processed, BTRFS_I(inode),
-					start, end, PageUptodate(page));
-		}
+		/* Update page status and unlock */
+		end_page_read(page, uptodate, start, len);
+		endio_readpage_release_extent(&processed, BTRFS_I(inode),
+				start, end, PageUptodate(page));
 
 		ASSERT(bio_offset + len > bio_offset);
 		bio_offset += len;
@@ -2965,7 +2536,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 	}
 	/* Release the last extent */
 	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
-	btrfs_bio_free_csum(bbio);
 	bio_put(bio);
 }
 
@@ -3158,7 +2728,8 @@ static int alloc_new_bio(struct btrfs_inode *inode,
 	struct bio *bio;
 	int ret;
 
-	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, end_io_func, NULL);
+	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, &inode->vfs_inode, end_io_func,
+			      NULL);
 	/*
 	 * For compressed page range, its disk_bytenr is always @disk_bytenr
 	 * passed in, no matter if we have added any range into previous bio.
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index e653e64598bf7..caf3343d1a36c 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -57,17 +57,11 @@ enum {
 #define BITMAP_LAST_BYTE_MASK(nbits) \
 	(BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
 
-struct btrfs_bio;
 struct btrfs_root;
 struct btrfs_inode;
 struct btrfs_fs_info;
-struct io_failure_record;
 struct extent_io_tree;
 
-typedef void (submit_bio_hook_t)(struct inode *inode, struct bio *bio,
-					 int mirror_num,
-					 enum btrfs_compression_type compress_type);
-
 typedef blk_status_t (extent_submit_bio_start_t)(struct inode *inode,
 		struct bio *bio, u64 dio_file_offset);
 
@@ -244,28 +238,6 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array);
 
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the sector is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-	struct page *page;
-	u64 start;
-	u64 len;
-	u64 logical;
-	int this_mirror;
-	int failed_mirror;
-	int num_copies;
-};
-
-int btrfs_repair_one_sector(struct inode *inode, struct btrfs_bio *failed_bbio,
-			    u32 bio_offset, struct page *page, unsigned int pgoff,
-			    submit_bio_hook_t *submit_bio_hook);
-
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
 			     struct page *locked_page, u64 *start,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 29999686d234c..ffbac8f257908 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -359,27 +359,27 @@ static int search_file_offset_in_bio(struct bio *bio, struct inode *inode,
  *       NULL, the checksum buffer is allocated and returned in
  *       btrfs_bio(bio)->csum instead.
  *
- * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
+ * Return: -ENOMEM if allocating memory fails, 0 otherwise.
  */
-blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst)
+int btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
 {
+	struct inode *inode = bbio->inode;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-	struct btrfs_bio *bbio = NULL;
+	struct bio *bio = &bbio->bio;
 	struct btrfs_path *path;
 	const u32 sectorsize = fs_info->sectorsize;
 	const u32 csum_size = fs_info->csum_size;
 	u32 orig_len = bio->bi_iter.bi_size;
 	u64 orig_disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 cur_disk_bytenr;
-	u8 *csum;
 	const unsigned int nblocks = orig_len >> fs_info->sectorsize_bits;
 	int count = 0;
-	blk_status_t ret = BLK_STS_OK;
+	int ret = 0;
 
 	if ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ||
 	    test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state))
-		return BLK_STS_OK;
+		return 0;
 
 	/*
 	 * This function is only called for read bio.
@@ -396,23 +396,16 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 	ASSERT(bio_op(bio) == REQ_OP_READ);
 	path = btrfs_alloc_path();
 	if (!path)
-		return BLK_STS_RESOURCE;
-
-	if (!dst) {
-		bbio = btrfs_bio(bio);
+		return -ENOMEM;
 
-		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
-			bbio->csum = kmalloc_array(nblocks, csum_size, GFP_NOFS);
-			if (!bbio->csum) {
-				btrfs_free_path(path);
-				return BLK_STS_RESOURCE;
-			}
-		} else {
-			bbio->csum = bbio->csum_inline;
+	if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
+		bbio->csum = kmalloc_array(nblocks, csum_size, GFP_NOFS);
+		if (!bbio->csum) {
+			btrfs_free_path(path);
+			return -ENOMEM;
 		}
-		csum = bbio->csum;
 	} else {
-		csum = dst;
+		bbio->csum = bbio->csum_inline;
 	}
 
 	/*
@@ -451,14 +444,15 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 		ASSERT(cur_disk_bytenr - orig_disk_bytenr < UINT_MAX);
 		sector_offset = (cur_disk_bytenr - orig_disk_bytenr) >>
 				fs_info->sectorsize_bits;
-		csum_dst = csum + sector_offset * csum_size;
+		csum_dst = bbio->csum + sector_offset * csum_size;
 
 		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
 					 search_len, csum_dst);
 		if (count < 0) {
-			ret = errno_to_blk_status(count);
-			if (bbio)
-				btrfs_bio_free_csum(bbio);
+			ret = count;
+			if (bbio->csum != bbio->csum_inline)
+				kfree(bbio->csum);
+			bbio->csum = NULL;
 			break;
 		}
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b9d40e25d978c..b3466015008c7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -85,9 +85,6 @@ struct btrfs_dio_private {
 	 */
 	refcount_t refs;
 
-	/* Array of checksums */
-	u8 *csums;
-
 	/* This must be last */
 	struct bio bio;
 };
@@ -2735,9 +2732,6 @@ void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirro
 void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
 			int mirror_num, enum btrfs_compression_type compress_type)
 {
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	blk_status_t ret;
-
 	if (compress_type != BTRFS_COMPRESS_NONE) {
 		/*
 		 * btrfs_submit_compressed_read will handle completing the bio
@@ -2747,20 +2741,7 @@ void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
 		return;
 	}
 
-	/* Save the original iter for read repair */
-	btrfs_bio(bio)->iter = bio->bi_iter;
-
-	/*
-	 * Lookup bio sums does extra checks around whether we need to csum or
-	 * not, which is why we ignore skip_sum here.
-	 */
-	ret = btrfs_lookup_bio_sums(inode, bio, NULL);
-	if (ret) {
-		btrfs_bio_end_io(btrfs_bio(bio), ret);
-		return;
-	}
-
-	btrfs_submit_bio(fs_info, bio, mirror_num);
+	btrfs_submit_bio(btrfs_sb(inode->i_sb), bio, mirror_num);
 }
 
 /*
@@ -3238,8 +3219,6 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 					ordered_extent->disk_num_bytes);
 	}
 
-	btrfs_free_io_failure_record(inode, start, end);
-
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
 		logical_len = ordered_extent->truncated_len;
@@ -3417,133 +3396,64 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 }
 
 /*
- * Verify the checksum for a single sector without any extra action that depend
- * on the type of I/O.
+ * btrfs_data_csum_ok - verify the checksum of single data sector
+ * @bbio:	btrfs_io_bio which contains the csum
+ * @dev:	device the sector is on
+ * @bio_offset:	offset to the beginning of the bio (in bytes)
+ * @bv:		bio_vec to check
+ *
+ * Check if the checksum on a data block is valid.  When a checksum mismatch is
+ * detected, report the error and fill the corrupted range with zero.
+ *
+ * Return %true if the sector is ok or had no checksum to start with, else
+ * %false.
  */
-int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
-			    u32 pgoff, u8 *csum, const u8 * const csum_expected)
+bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
+			u32 bio_offset, struct bio_vec *bv)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(bbio->inode->i_sb);
+	struct btrfs_inode *bi = BTRFS_I(bbio->inode);
+	u64 file_offset = bbio->file_offset + bio_offset;
+	u64 end = file_offset + bv->bv_len - 1;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	u8 *csum_expected;
+	u8 csum[BTRFS_CSUM_SIZE];
 	char *kaddr;
 
-	ASSERT(pgoff + fs_info->sectorsize <= PAGE_SIZE);
+	ASSERT(bv->bv_len == fs_info->sectorsize);
+
+	if (!bbio->csum)
+		return true;
+
+	if (btrfs_is_data_reloc_root(bi->root) &&
+	    test_range_bit(&bi->io_tree, file_offset, end, EXTENT_NODATASUM,
+			1, NULL)) {
+		/* Skip the range without csum for data reloc inode */
+		clear_extent_bits(&bi->io_tree, file_offset, end,
+				  EXTENT_NODATASUM);
+		return true;
+	}
+
+	csum_expected = btrfs_csum_ptr(fs_info, bbio->csum, bio_offset);
 
 	shash->tfm = fs_info->csum_shash;
 
-	kaddr = kmap_local_page(page) + pgoff;
+	kaddr = bvec_kmap_local(bv);
 	crypto_shash_digest(shash, kaddr, fs_info->sectorsize, csum);
 	kunmap_local(kaddr);
 
 	if (memcmp(csum, csum_expected, fs_info->csum_size))
-		return -EIO;
-	return 0;
-}
-
-/*
- * check_data_csum - verify checksum of one sector of uncompressed data
- * @inode:	inode
- * @bbio:	btrfs_bio which contains the csum
- * @bio_offset:	offset to the beginning of the bio (in bytes)
- * @page:	page where is the data to be verified
- * @pgoff:	offset inside the page
- *
- * The length of such check is always one sector size.
- *
- * When csum mismatch is detected, we will also report the error and fill the
- * corrupted range with zero. (Thus it needs the extra parameters)
- */
-int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
-			  u32 bio_offset, struct page *page, u32 pgoff)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	u32 len = fs_info->sectorsize;
-	u8 *csum_expected;
-	u8 csum[BTRFS_CSUM_SIZE];
-
-	ASSERT(pgoff + len <= PAGE_SIZE);
-
-	csum_expected = btrfs_csum_ptr(fs_info, bbio->csum, bio_offset);
-
-	if (btrfs_check_sector_csum(fs_info, page, pgoff, csum, csum_expected))
 		goto zeroit;
-	return 0;
+	return true;
 
 zeroit:
-	btrfs_print_data_csum_error(BTRFS_I(inode),
-				    bbio->file_offset + bio_offset,
-				    csum, csum_expected, bbio->mirror_num);
-	if (bbio->device)
-		btrfs_dev_stat_inc_and_print(bbio->device,
+	btrfs_print_data_csum_error(BTRFS_I(bbio->inode), file_offset, csum,
+				    csum_expected, bbio->mirror_num);
+	if (dev)
+		btrfs_dev_stat_inc_and_print(dev,
 					     BTRFS_DEV_STAT_CORRUPTION_ERRS);
-	memzero_page(page, pgoff, len);
-	return -EIO;
-}
-
-/*
- * When reads are done, we need to check csums to verify the data is correct.
- * if there's a match, we allow the bio to finish.  If not, the code in
- * extent_io.c will try to find good copies for us.
- *
- * @bio_offset:	offset to the beginning of the bio (in bytes)
- * @start:	file offset of the range start
- * @end:	file offset of the range end (inclusive)
- *
- * Return a bitmap where bit set means a csum mismatch, and bit not set means
- * csum match.
- */
-unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
-				    u32 bio_offset, struct page *page,
-				    u64 start, u64 end)
-{
-	struct inode *inode = page->mapping->host;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-	struct btrfs_root *root = BTRFS_I(inode)->root;
-	const u32 sectorsize = root->fs_info->sectorsize;
-	u32 pg_off;
-	unsigned int result = 0;
-
-	/*
-	 * This only happens for NODATASUM or compressed read.
-	 * Normally this should be covered by above check for compressed read
-	 * or the next check for NODATASUM.  Just do a quicker exit here.
-	 */
-	if (bbio->csum == NULL)
-		return 0;
-
-	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-		return 0;
-
-	if (unlikely(test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state)))
-		return 0;
-
-	ASSERT(page_offset(page) <= start &&
-	       end <= page_offset(page) + PAGE_SIZE - 1);
-	for (pg_off = offset_in_page(start);
-	     pg_off < offset_in_page(end);
-	     pg_off += sectorsize, bio_offset += sectorsize) {
-		u64 file_offset = pg_off + page_offset(page);
-		int ret;
-
-		if (btrfs_is_data_reloc_root(root) &&
-		    test_range_bit(io_tree, file_offset,
-				   file_offset + sectorsize - 1,
-				   EXTENT_NODATASUM, 1, NULL)) {
-			/* Skip the range without csum for data reloc inode */
-			clear_extent_bits(io_tree, file_offset,
-					  file_offset + sectorsize - 1,
-					  EXTENT_NODATASUM);
-			continue;
-		}
-		ret = btrfs_check_data_csum(inode, bbio, bio_offset, page, pg_off);
-		if (ret < 0) {
-			const int nr_bit = (pg_off - offset_in_page(start)) >>
-				     root->fs_info->sectorsize_bits;
-
-			result |= (1U << nr_bit);
-		}
-	}
-	return result;
+	memzero_bvec(bv);
+	return false;
 }
 
 /*
@@ -5437,8 +5347,6 @@ void btrfs_evict_inode(struct inode *inode)
 	if (is_bad_inode(inode))
 		goto no_delete;
 
-	btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
-
 	if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
 		goto no_delete;
 
@@ -7974,60 +7882,9 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 			      dip->file_offset + dip->bytes - 1);
 	}
 
-	kfree(dip->csums);
 	bio_endio(&dip->bio);
 }
 
-static void submit_dio_repair_bio(struct inode *inode, struct bio *bio,
-				  int mirror_num,
-				  enum btrfs_compression_type compress_type)
-{
-	struct btrfs_dio_private *dip = btrfs_bio(bio)->private;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
-
-	refcount_inc(&dip->refs);
-	btrfs_submit_bio(fs_info, bio, mirror_num);
-}
-
-static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
-					     struct btrfs_bio *bbio,
-					     const bool uptodate)
-{
-	struct inode *inode = dip->inode;
-	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
-	blk_status_t err = BLK_STS_OK;
-	struct bvec_iter iter;
-	struct bio_vec bv;
-	u32 offset;
-
-	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
-		u64 start = bbio->file_offset + offset;
-
-		if (uptodate &&
-		    (!csum || !btrfs_check_data_csum(inode, bbio, offset, bv.bv_page,
-					       bv.bv_offset))) {
-			clean_io_failure(fs_info, failure_tree, io_tree, start,
-					 bv.bv_page, btrfs_ino(BTRFS_I(inode)),
-					 bv.bv_offset);
-		} else {
-			int ret;
-
-			ret = btrfs_repair_one_sector(inode, bbio, offset,
-					bv.bv_page, bv.bv_offset,
-					submit_dio_repair_bio);
-			if (ret)
-				err = errno_to_blk_status(ret);
-		}
-	}
-
-	return err;
-}
-
 static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
 						     struct bio *bio,
 						     u64 dio_file_offset)
@@ -8041,18 +7898,14 @@ static void btrfs_end_dio_bio(struct btrfs_bio *bbio)
 	struct bio *bio = &bbio->bio;
 	blk_status_t err = bio->bi_status;
 
-	if (err)
+	if (err) {
 		btrfs_warn(BTRFS_I(dip->inode)->root->fs_info,
 			   "direct IO failed ino %llu rw %d,%u sector %#Lx len %u err no %d",
 			   btrfs_ino(BTRFS_I(dip->inode)), bio_op(bio),
 			   bio->bi_opf, bio->bi_iter.bi_sector,
 			   bio->bi_iter.bi_size, err);
-
-	if (bio_op(bio) == REQ_OP_READ)
-		err = btrfs_check_read_dio_bio(dip, bbio, !err);
-
-	if (err)
 		dip->bio.bi_status = err;
+	}
 
 	btrfs_record_physical_zoned(dip->inode, bbio->file_offset, bio);
 
@@ -8064,13 +7917,8 @@ static void btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 				 u64 file_offset, int async_submit)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_dio_private *dip = btrfs_bio(bio)->private;
 	blk_status_t ret;
-
-	/* Save the original iter for read repair */
-	if (btrfs_op(bio) == BTRFS_MAP_READ)
-		btrfs_bio(bio)->iter = bio->bi_iter;
-
+		
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
 		goto map;
 
@@ -8090,9 +7938,6 @@ static void btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 			btrfs_bio_end_io(btrfs_bio(bio), ret);
 			return;
 		}
-	} else {
-		btrfs_bio(bio)->csum = btrfs_csum_ptr(fs_info, dip->csums,
-						      file_offset - dip->file_offset);
 	}
 map:
 	btrfs_submit_bio(fs_info, bio, 0);
@@ -8104,7 +7949,6 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 	struct btrfs_dio_private *dip =
 		container_of(dio_bio, struct btrfs_dio_private, bio);
 	struct inode *inode = iter->inode;
-	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	const bool raid56 = (btrfs_data_alloc_profile(fs_info) &
 			     BTRFS_BLOCK_GROUP_RAID56_MASK);
@@ -8125,25 +7969,6 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 	dip->file_offset = file_offset;
 	dip->bytes = dio_bio->bi_iter.bi_size;
 	refcount_set(&dip->refs, 1);
-	dip->csums = NULL;
-
-	if (!write && !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-		unsigned int nr_sectors =
-			(dio_bio->bi_iter.bi_size >> fs_info->sectorsize_bits);
-
-		/*
-		 * Load the csums up front to reduce csum tree searches and
-		 * contention when submitting bios.
-		 */
-		status = BLK_STS_RESOURCE;
-		dip->csums = kcalloc(nr_sectors, fs_info->csum_size, GFP_NOFS);
-		if (!dip)
-			goto out_err;
-
-		status = btrfs_lookup_bio_sums(inode, dio_bio, dip->csums);
-		if (status != BLK_STS_OK)
-			goto out_err;
-	}
 
 	start_sector = dio_bio->bi_iter.bi_sector;
 	submit_len = dio_bio->bi_iter.bi_size;
@@ -8171,7 +7996,7 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 		 * the allocation is backed by btrfs_bioset.
 		 */
 		bio = btrfs_bio_clone_partial(dio_bio, clone_offset, clone_len,
-					      btrfs_end_dio_bio, dip);
+					      inode, btrfs_end_dio_bio, dip);
 		btrfs_bio(bio)->file_offset = file_offset;
 
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
@@ -8918,12 +8743,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	inode = &ei->vfs_inode;
 	extent_map_tree_init(&ei->extent_tree);
 	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO, inode);
-	extent_io_tree_init(fs_info, &ei->io_failure_tree,
-			    IO_TREE_INODE_IO_FAILURE, inode);
 	extent_io_tree_init(fs_info, &ei->file_extent_tree,
 			    IO_TREE_INODE_FILE_EXTENT, inode);
 	ei->io_tree.track_uptodate = true;
-	ei->io_failure_tree.track_uptodate = true;
 	atomic_set(&ei->sync_writers, 0);
 	mutex_init(&ei->log_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
@@ -10370,7 +10192,6 @@ struct btrfs_encoded_read_private {
 	wait_queue_head_t wait;
 	atomic_t pending;
 	blk_status_t status;
-	bool skip_csum;
 };
 
 static blk_status_t submit_encoded_read_bio(struct btrfs_inode *inode,
@@ -10378,57 +10199,17 @@ static blk_status_t submit_encoded_read_bio(struct btrfs_inode *inode,
 {
 	struct btrfs_encoded_read_private *priv = btrfs_bio(bio)->private;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	blk_status_t ret;
-
-	if (!priv->skip_csum) {
-		ret = btrfs_lookup_bio_sums(&inode->vfs_inode, bio, NULL);
-		if (ret)
-			return ret;
-	}
 
 	atomic_inc(&priv->pending);
 	btrfs_submit_bio(fs_info, bio, mirror_num);
 	return BLK_STS_OK;
 }
 
-static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
-{
-	const bool uptodate = (bbio->bio.bi_status == BLK_STS_OK);
-	struct btrfs_encoded_read_private *priv = bbio->private;
-	struct btrfs_inode *inode = priv->inode;
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u32 sectorsize = fs_info->sectorsize;
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
-	u32 bio_offset = 0;
-
-	if (priv->skip_csum || !uptodate)
-		return bbio->bio.bi_status;
-
-	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
-		unsigned int i, nr_sectors, pgoff;
-
-		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
-		pgoff = bvec->bv_offset;
-		for (i = 0; i < nr_sectors; i++) {
-			ASSERT(pgoff < PAGE_SIZE);
-			if (btrfs_check_data_csum(&inode->vfs_inode, bbio, bio_offset,
-					    bvec->bv_page, pgoff))
-				return BLK_STS_IOERR;
-			bio_offset += sectorsize;
-			pgoff += sectorsize;
-		}
-	}
-	return BLK_STS_OK;
-}
-
 static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 {
 	struct btrfs_encoded_read_private *priv = bbio->private;
-	blk_status_t status;
 
-	status = btrfs_encoded_read_verify_csum(bbio);
-	if (status) {
+	if (bbio->bio.bi_status) {
 		/*
 		 * The memory barrier implied by the atomic_dec_return() here
 		 * pairs with the memory barrier implied by the
@@ -10437,11 +10218,10 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 		 * write is observed before the load of status in
 		 * btrfs_encoded_read_regular_fill_pages().
 		 */
-		WRITE_ONCE(priv->status, status);
+		WRITE_ONCE(priv->status, bbio->bio.bi_status);
 	}
 	if (!atomic_dec_return(&priv->pending))
 		wake_up(&priv->wait);
-	btrfs_bio_free_csum(bbio);
 	bio_put(&bbio->bio);
 }
 
@@ -10454,7 +10234,6 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 		.inode = inode,
 		.file_offset = file_offset,
 		.pending = ATOMIC_INIT(1),
-		.skip_csum = (inode->flags & BTRFS_INODE_NODATASUM),
 	};
 	unsigned long i = 0;
 	u64 cur = 0;
@@ -10490,6 +10269,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 
 			if (!bio) {
 				bio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ,
+						      &inode->vfs_inode,
 						      btrfs_encoded_read_endio,
 						      &priv);
 				bio->bi_iter.bi_sector =
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dff735e36da96..b8472ab466abe 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -35,6 +35,14 @@
 #include "zoned.h"
 
 static struct bio_set btrfs_bioset;
+static struct bio_set btrfs_repair_bioset;
+static mempool_t btrfs_failed_bio_pool;
+
+struct btrfs_failed_bio {
+	struct btrfs_bio *bbio;
+	int num_copies;
+	atomic_t repair_count;
+};
 
 #define BTRFS_BLOCK_GROUP_STRIPE_MASK	(BTRFS_BLOCK_GROUP_RAID0 | \
 					 BTRFS_BLOCK_GROUP_RAID10 | \
@@ -6646,10 +6654,11 @@ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
  * Initialize a btrfs_bio structure.  This skips the embedded bio itself as it
  * is already initialized by the block layer.
  */
-static inline void btrfs_bio_init(struct btrfs_bio *bbio,
-				  btrfs_bio_end_io_t end_io, void *private)
+static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
+			   btrfs_bio_end_io_t end_io, void *private)
 {
 	memset(bbio, 0, offsetof(struct btrfs_bio, bio));
+	bbio->inode = inode;
 	bbio->end_io = end_io;
 	bbio->private = private;
 }
@@ -6662,16 +6671,18 @@ static inline void btrfs_bio_init(struct btrfs_bio *bbio,
  * a mempool.
  */
 struct bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
-			    btrfs_bio_end_io_t end_io, void *private)
+			    struct inode *inode, btrfs_bio_end_io_t end_io,
+			    void *private)
 {
 	struct bio *bio;
 
 	bio = bio_alloc_bioset(NULL, nr_vecs, opf, GFP_NOFS, &btrfs_bioset);
-	btrfs_bio_init(btrfs_bio(bio), end_io, private);
+	btrfs_bio_init(btrfs_bio(bio), inode, end_io, private);
 	return bio;
 }
 
 struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
+				    struct inode *inode,
 				    btrfs_bio_end_io_t end_io, void *private)
 {
 	struct bio *bio;
@@ -6681,13 +6692,174 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
 
 	bio = bio_alloc_clone(orig->bi_bdev, orig, GFP_NOFS, &btrfs_bioset);
 	bbio = btrfs_bio(bio);
-	btrfs_bio_init(bbio, end_io, private);
+	btrfs_bio_init(bbio, inode, end_io, private);
 
 	bio_trim(bio, offset >> 9, size >> 9);
-	bbio->iter = bio->bi_iter;
 	return bio;
 }
 
+static int next_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
+{
+	if (cur_mirror == fbio->num_copies)
+		return cur_mirror + 1 - fbio->num_copies;
+	return cur_mirror + 1;
+}
+
+static int prev_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
+{
+	if (cur_mirror == 1)
+		return fbio->num_copies;
+	return cur_mirror - 1;
+}
+
+static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
+{
+	if (atomic_dec_and_test(&fbio->repair_count)) {
+		fbio->bbio->end_io(fbio->bbio);
+		mempool_free(fbio, &btrfs_failed_bio_pool);
+	}
+}
+
+static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
+				 struct btrfs_device *dev)
+{
+	struct btrfs_failed_bio *fbio = repair_bbio->private;
+	struct inode *inode = repair_bbio->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
+	int mirror = repair_bbio->mirror_num;
+
+	if (repair_bbio->bio.bi_status ||
+	    !btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
+		bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
+		repair_bbio->bio.bi_iter = repair_bbio->saved_iter;
+
+		mirror = next_repair_mirror(fbio, mirror);
+		if (mirror == fbio->bbio->mirror_num) {
+			btrfs_debug(fs_info, "no mirror left");
+			fbio->bbio->bio.bi_status = BLK_STS_IOERR;
+			goto done;
+		}
+
+		btrfs_submit_bio(fs_info, &repair_bbio->bio, mirror);
+		return;
+	}
+
+	do {
+		mirror = prev_repair_mirror(fbio, mirror);
+		btrfs_repair_io_failure(fs_info, btrfs_ino(BTRFS_I(inode)),
+				  repair_bbio->file_offset, fs_info->sectorsize,
+				  repair_bbio->saved_iter.bi_sector <<
+					SECTOR_SHIFT,
+				  bv->bv_page, bv->bv_offset, mirror);
+	} while (mirror != fbio->bbio->mirror_num);
+
+done:
+	btrfs_repair_done(fbio);
+	bio_put(&repair_bbio->bio);
+}
+
+/*
+ * Try to kick off a repair read to the next available mirror for a bad
+ * sector.
+ *
+ * This primarily tries to recover good data to serve the actual read request,
+ * but also tries to write the good data back to the bad mirror(s) when a
+ * read succeeded to restore the redundancy.
+ */
+static void repair_one_sector(struct btrfs_bio *failed_bbio, u32 bio_offset,
+			      struct bio_vec *bv,
+			      struct btrfs_failed_bio **fbio)
+{
+	struct inode *inode = failed_bbio->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	const u32 sectorsize = fs_info->sectorsize;
+	const u64 logical = failed_bbio->saved_iter.bi_sector << SECTOR_SHIFT;
+	struct btrfs_bio *repair_bbio;
+	struct bio *repair_bio;
+	int num_copies;
+	int mirror;
+
+	btrfs_debug(fs_info, "repair read error: read error at %llu",
+		    failed_bbio->file_offset + bio_offset);
+
+	num_copies = btrfs_num_copies(fs_info, logical, sectorsize);
+	if (num_copies == 1) {
+		btrfs_debug(fs_info, "no copy to repair from");
+		failed_bbio->bio.bi_status = BLK_STS_IOERR;
+		return;
+	}
+
+	if (!*fbio) {
+		*fbio = mempool_alloc(&btrfs_failed_bio_pool, GFP_NOFS);
+		(*fbio)->bbio = failed_bbio;
+		(*fbio)->num_copies = num_copies;
+		atomic_set(&(*fbio)->repair_count, 1);
+	}
+
+	atomic_inc(&(*fbio)->repair_count);
+
+	repair_bio = bio_alloc_bioset(NULL, 1, REQ_OP_READ, GFP_NOFS,
+				      &btrfs_repair_bioset);
+	repair_bio->bi_iter.bi_sector = failed_bbio->saved_iter.bi_sector;
+	bio_add_page(repair_bio, bv->bv_page, bv->bv_len, bv->bv_offset);
+
+	repair_bbio = btrfs_bio(repair_bio);
+	btrfs_bio_init(repair_bbio, failed_bbio->inode, NULL, *fbio);
+	repair_bbio->file_offset = failed_bbio->file_offset + bio_offset;
+
+	mirror = next_repair_mirror(*fbio, failed_bbio->mirror_num);
+	btrfs_debug(fs_info, "submitting repair read to mirror %d", mirror);
+	btrfs_submit_bio(fs_info, repair_bio, mirror);
+}
+
+static void btrfs_check_read_bio(struct btrfs_bio *bbio,
+				 struct btrfs_device *dev)
+{
+	struct inode *inode = bbio->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	unsigned int sectorsize = fs_info->sectorsize;
+	struct bvec_iter *iter = &bbio->saved_iter;
+	blk_status_t status = bbio->bio.bi_status;
+	struct btrfs_failed_bio *fbio = NULL;
+	u32 offset = 0;
+
+	/*
+	 * Hand off repair bios to the repair code as there is no upper level
+	 * submitter for them.
+	 */
+	if (unlikely(bbio->bio.bi_pool == &btrfs_repair_bioset)) {
+		btrfs_end_repair_bio(bbio, dev);
+		return;
+	}
+
+	/* Metadata reads are checked and repaired by the submitter */
+	if (bbio->bio.bi_opf & REQ_META)
+		goto done;
+
+	/* Clear the I/O error.  A failed repair will reset it */
+	bbio->bio.bi_status = BLK_STS_OK;
+
+	while (iter->bi_size) {
+		struct bio_vec bv = bio_iter_iovec(&bbio->bio, *iter);
+
+		bv.bv_len = min(bv.bv_len, sectorsize);
+		if (status || !btrfs_data_csum_ok(bbio, dev, offset, &bv))
+			repair_one_sector(bbio, offset, &bv, &fbio);
+
+	     	bio_advance_iter_single(&bbio->bio, iter, sectorsize);
+		offset += sectorsize;
+	}
+
+	if (bbio->csum != bbio->csum_inline)
+		kfree(bbio->csum);
+done:
+	if (unlikely(fbio))
+		btrfs_repair_done(fbio);
+	else
+		bbio->end_io(bbio);
+}
+
 static void btrfs_log_dev_io_error(struct bio *bio, struct btrfs_device *dev)
 {
 	if (!dev || !dev->bdev)
@@ -6716,18 +6888,19 @@ static void btrfs_end_bio_work(struct work_struct *work)
 	struct btrfs_bio *bbio =
 		container_of(work, struct btrfs_bio, end_io_work);
 
-	bbio->end_io(bbio);
+	btrfs_check_read_bio(bbio, bbio->bio.bi_private);
 }
 
 static void btrfs_simple_end_io(struct bio *bio)
 {
-	struct btrfs_fs_info *fs_info = bio->bi_private;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
+	struct btrfs_device *dev = bio->bi_private;
+	struct btrfs_fs_info *fs_info = btrfs_sb(bbio->inode->i_sb);
 
 	btrfs_bio_counter_dec(fs_info);
 
 	if (bio->bi_status)
-		btrfs_log_dev_io_error(bio, bbio->device);
+		btrfs_log_dev_io_error(bio, dev);
 
 	if (bio_op(bio) == REQ_OP_READ) {
 		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
@@ -6744,7 +6917,10 @@ static void btrfs_raid56_end_io(struct bio *bio)
 
 	btrfs_bio_counter_dec(bioc->fs_info);
 	bbio->mirror_num = bioc->mirror_num;
-	bbio->end_io(bbio);
+	if (bio_op(bio) == REQ_OP_READ)
+		btrfs_check_read_bio(bbio, NULL);
+	else
+		bbio->end_io(bbio);
 
 	btrfs_put_bioc(bioc);
 }
@@ -6852,6 +7028,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 
 void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror_num)
 {
+	struct btrfs_bio *bbio = btrfs_bio(bio);
 	u64 logical = bio->bi_iter.bi_sector << 9;
 	u64 length = bio->bi_iter.bi_size;
 	u64 map_length = length;
@@ -6862,11 +7039,8 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror
 	btrfs_bio_counter_inc_blocked(fs_info);
 	ret = __btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
 				&bioc, &smap, &mirror_num, 1);
-	if (ret) {
-		btrfs_bio_counter_dec(fs_info);
-		btrfs_bio_end_io(btrfs_bio(bio), errno_to_blk_status(ret));
-		return;
-	}
+	if (ret)
+		goto fail;
 
 	if (map_length < length) {
 		btrfs_crit(fs_info,
@@ -6875,12 +7049,22 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror
 		BUG();
 	}
 
+	/*
+	 * Save the iter for the end_io handler and preload the checksums for
+	 * data reads.
+	 */
+	if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) {
+		bbio->saved_iter = bio->bi_iter;
+		ret = btrfs_lookup_bio_sums(bbio);
+		if (ret)
+			goto fail;
+	}
+
 	if (!bioc) {
 		/* Single mirror read/write fast path */
 		btrfs_bio(bio)->mirror_num = mirror_num;
-		btrfs_bio(bio)->device = smap.dev;
 		bio->bi_iter.bi_sector = smap.physical >> SECTOR_SHIFT;
-		bio->bi_private = fs_info;
+		bio->bi_private = smap.dev;
 		bio->bi_end_io = btrfs_simple_end_io;
 		btrfs_submit_dev_bio(smap.dev, bio);
 	} else if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
@@ -6900,6 +7084,11 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror
 		for (dev_nr = 0; dev_nr < total_devs; dev_nr++)
 			btrfs_submit_mirrored_bio(bioc, dev_nr);
 	}
+
+	return;
+fail:
+	btrfs_bio_counter_dec(fs_info);
+	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
 }
 
 /*
@@ -8499,10 +8688,25 @@ int __init btrfs_bioset_init(void)
 			offsetof(struct btrfs_bio, bio),
 			BIOSET_NEED_BVECS))
 		return -ENOMEM;
+	if (bioset_init(&btrfs_repair_bioset, BIO_POOL_SIZE,
+			offsetof(struct btrfs_bio, bio),
+			BIOSET_NEED_BVECS))
+		goto out_free_bioset;
+	if (mempool_init_kmalloc_pool(&btrfs_failed_bio_pool, BIO_POOL_SIZE,
+				      sizeof(struct btrfs_failed_bio)))
+		goto out_free_repair_bioset;
 	return 0;
+
+out_free_repair_bioset:
+	bioset_exit(&btrfs_repair_bioset);
+out_free_bioset:
+	bioset_exit(&btrfs_bioset);
+	return -ENOMEM;
 }
 
 void __cold btrfs_bioset_exit(void)
 {
+	mempool_exit(&btrfs_failed_bio_pool);
+	bioset_exit(&btrfs_repair_bioset);
 	bioset_exit(&btrfs_bioset);
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b368356fa78a1..58c4156caa736 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -364,27 +364,28 @@ struct btrfs_fs_devices {
 typedef void (*btrfs_bio_end_io_t)(struct btrfs_bio *bbio);
 
 /*
- * Additional info to pass along bio.
- *
- * Mostly for btrfs specific features like csum and mirror_num.
+ * Highlevel btrfs I/O structure.  It is allocated by btrfs_bio_alloc and
+ * passed to btrfs_submit_bio for mapping to the physical devices.
  */
 struct btrfs_bio {
-	unsigned int mirror_num;
-
-	/* for direct I/O */
+	/* Inode and offset into it that this I/O operates on. */
+	struct inode *inode;
 	u64 file_offset;
 
-	/* @device is for stripe IO submission. */
-	struct btrfs_device *device;
+	/*
+	 * Checksumming and original I/O information for internal use in the
+	 * btrfs_submit_bio machinery.
+	 */
 	u8 *csum;
 	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
-	struct bvec_iter iter;
+	struct bvec_iter saved_iter;
 
 	/* End I/O information supplied to btrfs_bio_alloc */
 	btrfs_bio_end_io_t end_io;
 	void *private;
 
-	/* For read end I/O handling */
+	/* For internal use in read end I/O handling */
+	unsigned int mirror_num;
 	struct work_struct end_io_work;
 
 	/*
@@ -403,8 +404,10 @@ int __init btrfs_bioset_init(void);
 void __cold btrfs_bioset_exit(void);
 
 struct bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
-			    btrfs_bio_end_io_t end_io, void *private);
+			    struct inode *inode, btrfs_bio_end_io_t end_io,
+			    void *private);
 struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
+				    struct inode *inode,
 				    btrfs_bio_end_io_t end_io, void *private);
 
 static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
@@ -413,30 +416,6 @@ static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 	bbio->end_io(bbio);
 }
 
-static inline void btrfs_bio_free_csum(struct btrfs_bio *bbio)
-{
-	if (bbio->csum != bbio->csum_inline) {
-		kfree(bbio->csum);
-		bbio->csum = NULL;
-	}
-}
-
-/*
- * Iterate through a btrfs_bio (@bbio) on a per-sector basis.
- *
- * bvl        - struct bio_vec
- * bbio       - struct btrfs_bio
- * iters      - struct bvec_iter
- * bio_offset - unsigned int
- */
-#define btrfs_bio_for_each_sector(fs_info, bvl, bbio, iter, bio_offset)	\
-	for ((iter) = (bbio)->iter, (bio_offset) = 0;			\
-	     (iter).bi_size &&					\
-	     (((bvl) = bio_iter_iovec((&(bbio)->bio), (iter))), 1);	\
-	     (bio_offset) += fs_info->sectorsize,			\
-	     bio_advance_iter_single(&(bbio)->bio, &(iter),		\
-	     (fs_info)->sectorsize))
-
 struct btrfs_io_stripe {
 	struct btrfs_device *dev;
 	union {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index f8a4118b16574..ed50e81174bf4 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -84,7 +84,6 @@ struct raid56_bio_trace_info;
 	EM( IO_TREE_FS_EXCLUDED_EXTENTS,  "EXCLUDED_EXTENTS")	    \
 	EM( IO_TREE_BTREE_INODE_IO,	  "BTREE_INODE_IO")	    \
 	EM( IO_TREE_INODE_IO,		  "INODE_IO")		    \
-	EM( IO_TREE_INODE_IO_FAILURE,	  "INODE_IO_FAILURE")	    \
 	EM( IO_TREE_RELOC_BLOCKS,	  "RELOC_BLOCKS")	    \
 	EM( IO_TREE_TRANS_DIRTY_PAGES,	  "TRANS_DIRTY_PAGES")      \
 	EM( IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES")   \
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 05/17] btrfs: handle checksum generation in the storage layer
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (3 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-07 20:33   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 06/17] btrfs: handle recording of zoned writes " Christoph Hellwig
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Instead of letting the callers of btrfs_submit_bio deal with checksumming
the (meta)data in the bio and making decisions on when to offload the
checksumming to the bio, leave that to btrfs_submit_bio.  Do do so the
existing btrfs_submit_bio function is split into an upper and a lower
half, so that the lower half can be offloaded to a workqueue.

The driver-private REQ_DRV flag is used to indicate the special 'bio must
be contained in a single ordered extent case' that is used by the
compressed write case instead of passing a new flag all the way down the
stack.

Note that this changes the behavior for direct writes to raid56 volumes so
that async checksum offloading is not skipped when more I/O is expected.
This runs counter to the argument explaining why it was done, although I
can't measure any affects of the change.  Commits later in this series
will make sure the entire direct writes is offloaded to the workqueue
at once and thus make sure it is sent to the raid56 code from a single
thread.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/compression.c |  13 +--
 fs/btrfs/ctree.h       |   4 +-
 fs/btrfs/disk-io.c     | 170 ++-------------------------------
 fs/btrfs/disk-io.h     |   5 -
 fs/btrfs/extent_io.h   |   3 -
 fs/btrfs/file-item.c   |  25 ++---
 fs/btrfs/inode.c       |  89 +-----------------
 fs/btrfs/volumes.c     | 208 ++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h     |   7 +-
 9 files changed, 215 insertions(+), 309 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index f932415a4f1df..53f9e123712b0 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -351,9 +351,9 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	u64 cur_disk_bytenr = disk_start;
 	u64 next_stripe_start;
 	blk_status_t ret = BLK_STS_OK;
-	int skip_sum = inode->flags & BTRFS_INODE_NODATASUM;
 	const bool use_append = btrfs_use_zone_append(inode, disk_start);
-	const enum req_op bio_op = use_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE;
+	const enum req_op bio_op = REQ_BTRFS_ONE_ORDERED |
+		(use_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE);
 
 	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
 	       IS_ALIGNED(len, fs_info->sectorsize));
@@ -431,15 +431,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			submit = true;
 
 		if (submit) {
-			if (!skip_sum) {
-				ret = btrfs_csum_one_bio(inode, bio, start, true);
-				if (ret) {
-					btrfs_bio_end_io(btrfs_bio(bio), ret);
-					break;
-				}
-			}
-
 			ASSERT(bio->bi_iter.bi_size);
+			btrfs_bio(bio)->file_offset = start;
 			btrfs_submit_bio(fs_info, bio, 0);
 			bio = NULL;
 		}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3dcb0d5f8faa0..33c3c394e43e3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3355,8 +3355,8 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct btrfs_ordered_sum *sums);
-blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
-				u64 offset, bool one_ordered);
+int btrfs_csum_one_bio(struct btrfs_bio *bbio);
+int btree_csum_one_bio(struct btrfs_bio *bbio);
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 			     struct list_head *list, int search_commit);
 void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a88d6c3b59042..ceee039b65ea0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -69,23 +69,6 @@ static void btrfs_free_csum_hash(struct btrfs_fs_info *fs_info)
 		crypto_free_shash(fs_info->csum_shash);
 }
 
-/*
- * async submit bios are used to offload expensive checksumming
- * onto the worker threads.  They checksum file and metadata bios
- * just before they are sent down the IO stack.
- */
-struct async_submit_bio {
-	struct inode *inode;
-	struct bio *bio;
-	extent_submit_bio_start_t *submit_bio_start;
-	int mirror_num;
-
-	/* Optional parameter for submit_bio_start used by direct io */
-	u64 dio_file_offset;
-	struct btrfs_work work;
-	blk_status_t status;
-};
-
 /*
  * Compute the csum of a btree block and store the result to provided buffer.
  */
@@ -649,161 +632,26 @@ int btrfs_validate_metadata_buffer(struct btrfs_bio *bbio,
 	return ret;
 }
 
-static void run_one_async_start(struct btrfs_work *work)
-{
-	struct async_submit_bio *async;
-	blk_status_t ret;
-
-	async = container_of(work, struct  async_submit_bio, work);
-	ret = async->submit_bio_start(async->inode, async->bio,
-				      async->dio_file_offset);
-	if (ret)
-		async->status = ret;
-}
-
-/*
- * In order to insert checksums into the metadata in large chunks, we wait
- * until bio submission time.   All the pages in the bio are checksummed and
- * sums are attached onto the ordered extent record.
- *
- * At IO completion time the csums attached on the ordered extent record are
- * inserted into the tree.
- */
-static void run_one_async_done(struct btrfs_work *work)
-{
-	struct async_submit_bio *async =
-		container_of(work, struct  async_submit_bio, work);
-	struct inode *inode = async->inode;
-	struct btrfs_bio *bbio = btrfs_bio(async->bio);
-
-	/* If an error occurred we just want to clean up the bio and move on */
-	if (async->status) {
-		btrfs_bio_end_io(bbio, async->status);
-		return;
-	}
-
-	/*
-	 * All of the bios that pass through here are from async helpers.
-	 * Use REQ_CGROUP_PUNT to issue them from the owning cgroup's context.
-	 * This changes nothing when cgroups aren't in use.
-	 */
-	async->bio->bi_opf |= REQ_CGROUP_PUNT;
-	btrfs_submit_bio(btrfs_sb(inode->i_sb), async->bio, async->mirror_num);
-}
-
-static void run_one_async_free(struct btrfs_work *work)
-{
-	struct async_submit_bio *async;
-
-	async = container_of(work, struct  async_submit_bio, work);
-	kfree(async);
-}
-
-/*
- * Submit bio to an async queue.
- *
- * Retrun:
- * - true if the work has been succesfuly submitted
- * - false in case of error
- */
-bool btrfs_wq_submit_bio(struct inode *inode, struct bio *bio, int mirror_num,
-			 u64 dio_file_offset,
-			 extent_submit_bio_start_t *submit_bio_start)
-{
-	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
-	struct async_submit_bio *async;
-
-	async = kmalloc(sizeof(*async), GFP_NOFS);
-	if (!async)
-		return false;
-
-	async->inode = inode;
-	async->bio = bio;
-	async->mirror_num = mirror_num;
-	async->submit_bio_start = submit_bio_start;
-
-	btrfs_init_work(&async->work, run_one_async_start, run_one_async_done,
-			run_one_async_free);
-
-	async->dio_file_offset = dio_file_offset;
-
-	async->status = 0;
-
-	if (op_is_sync(bio->bi_opf))
-		btrfs_queue_work(fs_info->hipri_workers, &async->work);
-	else
-		btrfs_queue_work(fs_info->workers, &async->work);
-	return true;
-}
-
-static blk_status_t btree_csum_one_bio(struct bio *bio)
+int btree_csum_one_bio(struct btrfs_bio *bbio)
 {
-	struct bio_vec *bvec;
-	struct btrfs_root *root;
-	int ret = 0;
-	struct bvec_iter_all iter_all;
+	struct btrfs_fs_info *fs_info = btrfs_sb(bbio->inode->i_sb);
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	int ret;
 
-	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		root = BTRFS_I(bvec->bv_page->mapping->host)->root;
-		ret = csum_dirty_buffer(root->fs_info, bvec);
+	bio_for_each_segment(bvec, &bbio->bio, iter) {
+		ret = csum_dirty_buffer(fs_info, &bvec);
 		if (ret)
 			break;
 	}
 
-	return errno_to_blk_status(ret);
-}
-
-static blk_status_t btree_submit_bio_start(struct inode *inode, struct bio *bio,
-					   u64 dio_file_offset)
-{
-	/*
-	 * when we're called for a write, we're already in the async
-	 * submission context.  Just jump into btrfs_submit_bio.
-	 */
-	return btree_csum_one_bio(bio);
-}
-
-static bool should_async_write(struct btrfs_fs_info *fs_info,
-			     struct btrfs_inode *bi)
-{
-	if (btrfs_is_zoned(fs_info))
-		return false;
-	if (atomic_read(&bi->sync_writers))
-		return false;
-	if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))
-		return false;
-	return true;
+	return ret;
 }
 
 void btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio, int mirror_num)
 {
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_bio *bbio = btrfs_bio(bio);
-	blk_status_t ret;
-
 	bio->bi_opf |= REQ_META;
-
-	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
-		btrfs_submit_bio(fs_info, bio, mirror_num);
-		return;
-	}
-
-	/*
-	 * Kthread helpers are used to submit writes so that checksumming can
-	 * happen in parallel across all CPUs.
-	 */
-	if (should_async_write(fs_info, BTRFS_I(inode)) &&
-	    btrfs_wq_submit_bio(inode, bio, mirror_num, 0, btree_submit_bio_start))
-		return;
-
-	ret = btree_csum_one_bio(bio);
-	if (ret) {
-		btrfs_bio_end_io(bbio, ret);
-		return;
-	}
-
-	btrfs_submit_bio(fs_info, bio, mirror_num);
+	btrfs_submit_bio(btrfs_sb(inode->i_sb), bio, mirror_num);
 }
 
 #ifdef CONFIG_MIGRATION
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 47ad8e0a2d33f..9d4e0e36f7bb9 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -114,11 +114,6 @@ int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
 			  int atomic);
 int btrfs_read_extent_buffer(struct extent_buffer *buf, u64 parent_transid,
 			     int level, struct btrfs_key *first_key);
-bool btrfs_wq_submit_bio(struct inode *inode, struct bio *bio, int mirror_num,
-			 u64 dio_file_offset,
-			 extent_submit_bio_start_t *submit_bio_start);
-blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
-			  int mirror_num);
 int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root);
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index caf3343d1a36c..ddbeba7c6118a 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -62,9 +62,6 @@ struct btrfs_inode;
 struct btrfs_fs_info;
 struct extent_io_tree;
 
-typedef blk_status_t (extent_submit_bio_start_t)(struct inode *inode,
-		struct bio *bio, u64 dio_file_offset);
-
 #define INLINE_EXTENT_BUFFER_PAGES     (BTRFS_MAX_METADATA_BLOCKSIZE / PAGE_SIZE)
 struct extent_buffer {
 	u64 start;
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index ffbac8f257908..5b3279e38665b 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -613,23 +613,17 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 
 /**
  * Calculate checksums of the data contained inside a bio
- *
- * @inode:	 Owner of the data inside the bio
- * @bio:	 Contains the data to be checksummed
- * @offset:      If (u64)-1, @bio may contain discontiguous bio vecs, so the
- *               file offsets are determined from the page offsets in the bio.
- *               Otherwise, this is the starting file offset of the bio vecs in
- *               @bio, which must be contiguous.
- * @one_ordered: If true, @bio only refers to one ordered extent.
+ * @bbio:	 Contains the data to be checksummed
  */
-blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
-				u64 offset, bool one_ordered)
+int btrfs_csum_one_bio(struct btrfs_bio *bbio)
 {
+	struct btrfs_inode *inode = BTRFS_I(bbio->inode);
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	struct bio *bio = &bbio->bio;
+	u64 offset = bbio->file_offset;
 	struct btrfs_ordered_sum *sums;
 	struct btrfs_ordered_extent *ordered = NULL;
-	const bool use_page_offsets = (offset == (u64)-1);
 	char *data;
 	struct bvec_iter iter;
 	struct bio_vec bvec;
@@ -646,7 +640,7 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 	memalloc_nofs_restore(nofs_flag);
 
 	if (!sums)
-		return BLK_STS_RESOURCE;
+		return -ENOMEM;
 
 	sums->len = bio->bi_iter.bi_size;
 	INIT_LIST_HEAD(&sums->list);
@@ -657,9 +651,6 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 	shash->tfm = fs_info->csum_shash;
 
 	bio_for_each_segment(bvec, bio, iter) {
-		if (use_page_offsets)
-			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
-
 		if (!ordered) {
 			ordered = btrfs_lookup_ordered_extent(inode, offset);
 			/*
@@ -672,7 +663,7 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 				     inode->root->root_key.objectid,
 				     btrfs_ino(inode), offset);
 				kvfree(sums);
-				return BLK_STS_IOERR;
+				return -EIO;
 			}
 		}
 
@@ -681,7 +672,7 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 						 - 1);
 
 		for (i = 0; i < blockcount; i++) {
-			if (!one_ordered &&
+			if (!(bio->bi_opf & REQ_BTRFS_ONE_ORDERED) &&
 			    !in_range(offset, ordered->file_offset,
 				      ordered->num_bytes)) {
 				unsigned long bytes_left;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b3466015008c7..88dd99997631a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2500,20 +2500,6 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode,
 	}
 }
 
-/*
- * in order to insert checksums into the metadata in large chunks,
- * we wait until bio submission time.   All the pages in the bio are
- * checksummed and sums are attached onto the ordered extent record.
- *
- * At IO completion time the cums attached on the ordered extent record
- * are inserted into the btree
- */
-static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
-					   u64 dio_file_offset)
-{
-	return btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false);
-}
-
 /*
  * Split an extent_map at [start, start + len]
  *
@@ -2704,28 +2690,6 @@ void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirro
 		}
 	}
 
-	/*
-	 * If we need to checksum, and the I/O is not issued by fsync and
-	 * friends, that is ->sync_writers != 0, defer the submission to a
-	 * workqueue to parallelize it.
-	 *
-	 * Csum items for reloc roots have already been cloned at this point,
-	 * so they are handled as part of the no-checksum case.
-	 */
-	if (!(bi->flags & BTRFS_INODE_NODATASUM) &&
-	    !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state) &&
-	    !btrfs_is_data_reloc_root(bi->root)) {
-		if (!atomic_read(&bi->sync_writers) &&
-		    btrfs_wq_submit_bio(inode, bio, mirror_num, 0,
-					btrfs_submit_bio_start))
-			return;
-
-		ret = btrfs_csum_one_bio(bi, bio, (u64)-1, false);
-		if (ret) {
-			btrfs_bio_end_io(btrfs_bio(bio), ret);
-			return;
-		}
-	}
 	btrfs_submit_bio(fs_info, bio, mirror_num);
 }
 
@@ -7885,13 +7849,6 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	bio_endio(&dip->bio);
 }
 
-static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
-						     struct bio *bio,
-						     u64 dio_file_offset)
-{
-	return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, false);
-}
-
 static void btrfs_end_dio_bio(struct btrfs_bio *bbio)
 {
 	struct btrfs_dio_private *dip = bbio->private;
@@ -7913,36 +7870,6 @@ static void btrfs_end_dio_bio(struct btrfs_bio *bbio)
 	btrfs_dio_private_put(dip);
 }
 
-static void btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
-				 u64 file_offset, int async_submit)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	blk_status_t ret;
-		
-	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-		goto map;
-
-	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
-		/* Check btrfs_submit_data_write_bio() for async submit rules */
-		if (async_submit && !atomic_read(&BTRFS_I(inode)->sync_writers) &&
-		    btrfs_wq_submit_bio(inode, bio, 0, file_offset,
-					btrfs_submit_bio_start_direct_io))
-			return;
-
-		/*
-		 * If we aren't doing async submit, calculate the csum of the
-		 * bio now.
-		 */
-		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, false);
-		if (ret) {
-			btrfs_bio_end_io(btrfs_bio(bio), ret);
-			return;
-		}
-	}
-map:
-	btrfs_submit_bio(fs_info, bio, 0);
-}
-
 static void btrfs_submit_direct(const struct iomap_iter *iter,
 		struct bio *dio_bio, loff_t file_offset)
 {
@@ -7950,11 +7877,8 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 		container_of(dio_bio, struct btrfs_dio_private, bio);
 	struct inode *inode = iter->inode;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	const bool raid56 = (btrfs_data_alloc_profile(fs_info) &
-			     BTRFS_BLOCK_GROUP_RAID56_MASK);
 	struct bio *bio;
 	u64 start_sector;
-	int async_submit = 0;
 	u64 submit_len;
 	u64 clone_offset = 0;
 	u64 clone_len;
@@ -8020,19 +7944,10 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 		 * We transfer the initial reference to the last bio, so we
 		 * don't need to increment the reference count for the last one.
 		 */
-		if (submit_len > 0) {
+		if (submit_len > 0)
 			refcount_inc(&dip->refs);
-			/*
-			 * If we are submitting more than one bio, submit them
-			 * all asynchronously. The exception is RAID 5 or 6, as
-			 * asynchronous checksums make it difficult to collect
-			 * full stripe writes.
-			 */
-			if (!raid56)
-				async_submit = 1;
-		}
 
-		btrfs_submit_dio_bio(bio, inode, file_offset, async_submit);
+		btrfs_submit_bio(fs_info, bio, 0);
 
 		dio_data->submitted += clone_len;
 		clone_offset += clone_len;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b8472ab466abe..2d13e8b52c94f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7026,7 +7026,170 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 	btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
 }
 
-void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror_num)
+static void __btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
+			       struct btrfs_io_stripe *smap, int mirror_num)
+{
+	/* Do not leak our private flag into the block layer */
+	bio->bi_opf &= ~REQ_BTRFS_ONE_ORDERED;
+
+	if (!bioc) {
+		/* Single mirror read/write fast path */
+		btrfs_bio(bio)->mirror_num = mirror_num;
+		bio->bi_iter.bi_sector = smap->physical >> SECTOR_SHIFT;
+		bio->bi_private = smap->dev;
+		bio->bi_end_io = btrfs_simple_end_io;
+		btrfs_submit_dev_bio(smap->dev, bio);
+	} else if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+		/* Parity RAID write or read recovery */
+		bio->bi_private = bioc;
+		bio->bi_end_io = btrfs_raid56_end_io;
+		if (bio_op(bio) == REQ_OP_READ)
+			raid56_parity_recover(bio, bioc, mirror_num);
+		else
+			raid56_parity_write(bio, bioc);
+	} else {
+		/* Write to multiple mirrors */
+		int total_devs = bioc->num_stripes;
+		int dev_nr;
+
+		bioc->orig_bio = bio;
+		for (dev_nr = 0; dev_nr < total_devs; dev_nr++)
+			btrfs_submit_mirrored_bio(bioc, dev_nr);
+	}
+}
+
+/*
+ * async submit bios are used to offload expensive checksumming
+ * onto the worker threads.
+ */
+struct async_submit_bio {
+	struct btrfs_bio *bbio;
+	struct btrfs_io_context *bioc;
+	struct btrfs_io_stripe smap;
+	int mirror_num;
+	struct btrfs_work work;
+};
+
+/*
+ * In order to insert checksums into the metadata in large chunks,
+ * we wait until bio submission time.   All the pages in the bio are
+ * checksummed and sums are attached onto the ordered extent record.
+ *
+ * At IO completion time the cums attached on the ordered extent record
+ * are inserted into the btree
+ */
+static void run_one_async_start(struct btrfs_work *work)
+{
+	struct async_submit_bio *async =
+		container_of(work, struct async_submit_bio, work);
+	struct btrfs_bio *bbio = async->bbio;
+	blk_status_t ret;
+
+	if (bbio->bio.bi_opf & REQ_META)
+		ret = btree_csum_one_bio(bbio);
+	else
+		ret = btrfs_csum_one_bio(bbio);
+	if (ret)
+		bbio->bio.bi_status = errno_to_blk_status(ret);
+}
+
+/*
+ * In order to insert checksums into the metadata in large chunks, we wait
+ * until bio submission time.   All the pages in the bio are checksummed and
+ * sums are attached onto the ordered extent record.
+ *
+ * At IO completion time the csums attached on the ordered extent record are
+ * inserted into the tree.
+ */
+static void run_one_async_done(struct btrfs_work *work)
+{
+	struct async_submit_bio *async =
+		container_of(work, struct async_submit_bio, work);
+	struct bio *bio = &async->bbio->bio;
+
+	/* If an error occurred we just want to clean up the bio and move on */
+	if (bio->bi_status) {
+		btrfs_bio_end_io(async->bbio, bio->bi_status);
+		return;
+	}
+
+	/*
+	 * All of the bios that pass through here are from async helpers.
+	 * Use REQ_CGROUP_PUNT to issue them from the owning cgroup's context.
+	 * This changes nothing when cgroups aren't in use.
+	 */
+	bio->bi_opf |= REQ_CGROUP_PUNT;
+	__btrfs_submit_bio(bio, async->bioc, &async->smap, async->mirror_num);
+}
+
+static void run_one_async_free(struct btrfs_work *work)
+{
+	kfree(container_of(work, struct async_submit_bio, work));
+}
+
+static bool should_async_write(struct btrfs_bio *bbio)
+{
+	struct btrfs_inode *bi = BTRFS_I(bbio->inode);
+
+	/*
+	 * If the I/O is not issued by fsync and friends, (->sync_writers != 0),
+	 * then try to defer the submission to a workqueue to parallelize the
+	 * checksum calculation.
+	 */
+	if (atomic_read(&bi->sync_writers))
+		return false;
+
+	/*
+	 * Submit metadata writes synchronously if the checksum implementation
+	 * is fast, or we are on a zoned device that wants I/O to be submitted
+	 * in order.
+	 */
+	if (bbio->bio.bi_opf & REQ_META) {
+		struct btrfs_fs_info *fs_info = bi->root->fs_info;
+
+		if (btrfs_is_zoned(fs_info))
+			return false;
+		if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * Submit bio to an async queue.
+ *
+ * Retrun:
+ * - true if the work has been succesfuly submitted
+ * - false in case of error
+ */
+static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio,
+				struct btrfs_io_context *bioc,
+			        struct btrfs_io_stripe *smap, int mirror_num)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(bbio->inode->i_sb);
+	struct async_submit_bio *async;
+
+	async = kmalloc(sizeof(*async), GFP_NOFS);
+	if (!async)
+		return false;
+
+	async->bbio = bbio;
+	async->bioc = bioc;
+	async->smap = *smap;
+	async->mirror_num = mirror_num;
+
+	btrfs_init_work(&async->work, run_one_async_start, run_one_async_done,
+			run_one_async_free);
+	if (op_is_sync(bbio->bio.bi_opf))
+		btrfs_queue_work(fs_info->hipri_workers, &async->work);
+	else
+		btrfs_queue_work(fs_info->workers, &async->work);
+	return true;
+}
+
+void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+		      int mirror_num)
 {
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	u64 logical = bio->bi_iter.bi_sector << 9;
@@ -7060,31 +7223,30 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror
 			goto fail;
 	}
 
-	if (!bioc) {
-		/* Single mirror read/write fast path */
-		btrfs_bio(bio)->mirror_num = mirror_num;
-		bio->bi_iter.bi_sector = smap.physical >> SECTOR_SHIFT;
-		bio->bi_private = smap.dev;
-		bio->bi_end_io = btrfs_simple_end_io;
-		btrfs_submit_dev_bio(smap.dev, bio);
-	} else if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-		/* Parity RAID write or read recovery */
-		bio->bi_private = bioc;
-		bio->bi_end_io = btrfs_raid56_end_io;
-		if (bio_op(bio) == REQ_OP_READ)
-			raid56_parity_recover(bio, bioc, mirror_num);
-		else
-			raid56_parity_write(bio, bioc);
-	} else {
-		/* Write to multiple mirrors */
-		int total_devs = bioc->num_stripes;
-		int dev_nr;
+	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
+		struct btrfs_inode *bi = BTRFS_I(bbio->inode);
 
-		bioc->orig_bio = bio;
-		for (dev_nr = 0; dev_nr < total_devs; dev_nr++)
-			btrfs_submit_mirrored_bio(bioc, dev_nr);
+		/*
+		 * Csum items for reloc roots have already been cloned at this
+		 * point, so they are handled as part of the no-checksum case.
+		 */
+		if (!(bi->flags & BTRFS_INODE_NODATASUM) &&
+		    !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state) &&
+		    !btrfs_is_data_reloc_root(bi->root)) {
+			if (should_async_write(bbio) &&
+			    btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
+				return;
+
+			if (bio->bi_opf & REQ_META)
+				ret = btree_csum_one_bio(bbio);
+			else
+				ret = btrfs_csum_one_bio(bbio);
+			if (ret)
+				goto fail;
+		}
 	}
 
+	__btrfs_submit_bio(bio, bioc, &smap, mirror_num);
 	return;
 fail:
 	btrfs_bio_counter_dec(fs_info);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 58c4156caa736..8b248c9bd602b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -576,7 +576,12 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info);
 struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 					    u64 type);
 void btrfs_mapping_tree_free(struct extent_map_tree *tree);
-void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror_num);
+
+/* bio only refers to one ordered extent */
+#define REQ_BTRFS_ONE_ORDERED	REQ_DRV
+
+void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+		      int mirror_num);
 int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 			    u64 length, u64 logical, struct page *page,
 			    unsigned int pg_offset, int mirror_num);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 06/17] btrfs: handle recording of zoned writes in the storage layer
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (4 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 05/17] btrfs: handle checksum generation in " Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01  9:44   ` Johannes Thumshirn
                     ` (2 more replies)
  2022-09-01  7:42 ` [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios Christoph Hellwig
                   ` (13 subsequent siblings)
  19 siblings, 3 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Move the code that splits the ordered extents and records the physical
location for them to the storage layer so that the higher level consumers
don't have to care about physical block numbers at all.  This will also
allow to eventually remove accounting for the zone append write sizes in
the upper layer with a little bit more block layer work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/compression.c  |  1 -
 fs/btrfs/extent_io.c    |  6 ------
 fs/btrfs/inode.c        | 40 ++++++++--------------------------------
 fs/btrfs/ordered-data.h |  1 +
 fs/btrfs/volumes.c      |  8 ++++++++
 fs/btrfs/zoned.c        | 13 +++++--------
 fs/btrfs/zoned.h        |  6 ++----
 7 files changed, 24 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 53f9e123712b0..1f10f86e70557 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -270,7 +270,6 @@ static void end_compressed_bio_write(struct btrfs_bio *bbio)
 	if (refcount_dec_and_test(&cb->pending_ios)) {
 		struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
 
-		btrfs_record_physical_zoned(cb->inode, cb->start, &bbio->bio);
 		queue_work(fs_info->compressed_write_workers, &cb->write_end_work);
 	}
 	bio_put(&bbio->bio);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d8c43e2111a99..4c00bdefe5b45 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2285,7 +2285,6 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
 	u64 start;
 	u64 end;
 	struct bvec_iter_all iter_all;
-	bool first_bvec = true;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
@@ -2307,11 +2306,6 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
 		start = page_offset(page) + bvec->bv_offset;
 		end = start + bvec->bv_len - 1;
 
-		if (first_bvec) {
-			btrfs_record_physical_zoned(inode, start, bio);
-			first_bvec = false;
-		}
-
 		end_extent_writepage(page, error, start, end);
 
 		btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 88dd99997631a..03953c1f176dd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2615,21 +2615,21 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 	return ret;
 }
 
-static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
-					   struct bio *bio, loff_t file_offset)
+int btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 {
+	u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 len = bbio->bio.bi_iter.bi_size;
+	struct btrfs_inode *bi = BTRFS_I(bbio->inode);
 	struct btrfs_ordered_extent *ordered;
-	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 file_len;
-	u64 len = bio->bi_iter.bi_size;
 	u64 end = start + len;
 	u64 ordered_end;
 	u64 pre, post;
 	int ret = 0;
 
-	ordered = btrfs_lookup_ordered_extent(inode, file_offset);
+	ordered = btrfs_lookup_ordered_extent(bi, bbio->file_offset);
 	if (WARN_ON_ONCE(!ordered))
-		return BLK_STS_IOERR;
+		return -EIO;
 
 	/* No need to split */
 	if (ordered->disk_num_bytes == len)
@@ -2667,28 +2667,16 @@ static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
 	ret = btrfs_split_ordered_extent(ordered, pre, post);
 	if (ret)
 		goto out;
-	ret = split_zoned_em(inode, file_offset, file_len, pre, post);
+	ret = split_zoned_em(bi, bbio->file_offset, file_len, pre, post);
 
 out:
 	btrfs_put_ordered_extent(ordered);
-
-	return errno_to_blk_status(ret);
+	return ret;
 }
 
 void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirror_num)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_inode *bi = BTRFS_I(inode);
-	blk_status_t ret;
-
-	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		ret = extract_ordered_extent(bi, bio,
-				page_offset(bio_first_bvec_all(bio)->bv_page));
-		if (ret) {
-			btrfs_bio_end_io(btrfs_bio(bio), ret);
-			return;
-		}
-	}
 
 	btrfs_submit_bio(fs_info, bio, mirror_num);
 }
@@ -7864,8 +7852,6 @@ static void btrfs_end_dio_bio(struct btrfs_bio *bbio)
 		dip->bio.bi_status = err;
 	}
 
-	btrfs_record_physical_zoned(dip->inode, bbio->file_offset, bio);
-
 	bio_put(bio);
 	btrfs_dio_private_put(dip);
 }
@@ -7923,15 +7909,6 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 					      inode, btrfs_end_dio_bio, dip);
 		btrfs_bio(bio)->file_offset = file_offset;
 
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			status = extract_ordered_extent(BTRFS_I(inode), bio,
-							file_offset);
-			if (status) {
-				bio_put(bio);
-				goto out_err;
-			}
-		}
-
 		ASSERT(submit_len >= clone_len);
 		submit_len -= clone_len;
 
@@ -7960,7 +7937,6 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
 
 out_err_em:
 	free_extent_map(em);
-out_err:
 	dio_bio->bi_status = status;
 	btrfs_dio_private_put(dip);
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 87792f85e2c4a..0cef17f4b752f 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -220,6 +220,7 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					struct extent_state **cached_state);
 int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 			       u64 post);
+int btrfs_extract_ordered_extent(struct btrfs_bio *bbio);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2d13e8b52c94f..5c6535e10085d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6906,6 +6906,8 @@ static void btrfs_simple_end_io(struct bio *bio)
 		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
 		queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
 	} else {
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+			btrfs_record_physical_zoned(bbio);
 		bbio->end_io(bbio);
 	}
 }
@@ -7226,6 +7228,12 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
 		struct btrfs_inode *bi = BTRFS_I(bbio->inode);
 
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
+			if (ret)
+				goto fail;
+		}
+
 		/*
 		 * Csum items for reloc roots have already been cloned at this
 		 * point, so they are handled as part of the no-checksum case.
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index dc96b3331bfb7..2638f71eec4b6 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1633,21 +1633,18 @@ bool btrfs_use_zone_append(struct btrfs_inode *inode, u64 start)
 	return ret;
 }
 
-void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
-				 struct bio *bio)
+void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 {
+	const u64 physical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
+	struct btrfs_inode *bi = BTRFS_I(bbio->inode);
 	struct btrfs_ordered_extent *ordered;
-	const u64 physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 
-	if (bio_op(bio) != REQ_OP_ZONE_APPEND)
-		return;
-
-	ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
+	ordered = btrfs_lookup_ordered_extent(bi, bbio->file_offset);
 	if (WARN_ON(!ordered))
 		return;
 
 	ordered->physical = physical;
-	ordered->bdev = bio->bi_bdev;
+	ordered->bdev = bbio->bio.bi_bdev;
 
 	btrfs_put_ordered_extent(ordered);
 }
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index e17462db3a842..cafa639927050 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -55,8 +55,7 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 bool btrfs_use_zone_append(struct btrfs_inode *inode, u64 start);
-void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
-				 struct bio *bio);
+void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
 void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb,
@@ -178,8 +177,7 @@ static inline bool btrfs_use_zone_append(struct btrfs_inode *inode, u64 start)
 	return false;
 }
 
-static inline void btrfs_record_physical_zoned(struct inode *inode,
-					       u64 file_offset, struct bio *bio)
+static inline void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 {
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (5 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 06/17] btrfs: handle recording of zoned writes " Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01  9:47   ` Johannes Thumshirn
                     ` (2 more replies)
  2022-09-01  7:42 ` [PATCH 08/17] btrfs: pass the iomap bio to btrfs_submit_bio Christoph Hellwig
                   ` (12 subsequent siblings)
  19 siblings, 3 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Currently the I/O submitters have to split bios according to the
chunk stripe boundaries.  This leads to extra lookups in the extent
trees and a lot of boilerplate code.

To drop this requirement, split the bio when __btrfs_map_block
returns a mapping that is smaller than the requested size and
keep a count of pending bios in the original btrfs_bio so that
the upper level completion is only invoked when all clones have
completed.

Based on a patch from Qu Wenruo.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/volumes.c | 106 +++++++++++++++++++++++++++++++++++++--------
 fs/btrfs/volumes.h |   1 +
 2 files changed, 90 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5c6535e10085d..0a2d144c20604 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -35,6 +35,7 @@
 #include "zoned.h"
 
 static struct bio_set btrfs_bioset;
+static struct bio_set btrfs_clone_bioset;
 static struct bio_set btrfs_repair_bioset;
 static mempool_t btrfs_failed_bio_pool;
 
@@ -6661,6 +6662,7 @@ static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
 	bbio->inode = inode;
 	bbio->end_io = end_io;
 	bbio->private = private;
+	atomic_set(&bbio->pending_ios, 1);
 }
 
 /*
@@ -6698,6 +6700,57 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
 	return bio;
 }
 
+static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
+{
+	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
+	struct bio *bio;
+
+	bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
+			&btrfs_clone_bioset);
+	btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode, NULL, orig_bbio);
+
+	btrfs_bio(bio)->file_offset = orig_bbio->file_offset;
+	orig_bbio->file_offset += map_length;
+
+	atomic_inc(&orig_bbio->pending_ios);
+	return bio;
+}
+
+static void btrfs_orig_write_end_io(struct bio *bio);
+static void btrfs_bbio_propagate_error(struct btrfs_bio *bbio,
+				       struct btrfs_bio *orig_bbio)
+{
+	/*
+	 * For writes btrfs tolerates nr_mirrors - 1 write failures, so we
+	 * can't just blindly propagate a write failure here.
+	 * Instead increment the error count in the original I/O context so
+	 * that it is guaranteed to be larger than the error tolerance.
+	 */
+	if (bbio->bio.bi_end_io == &btrfs_orig_write_end_io) {
+		struct btrfs_io_stripe *orig_stripe = orig_bbio->bio.bi_private;
+		struct btrfs_io_context *orig_bioc = orig_stripe->bioc;
+		
+		atomic_add(orig_bioc->max_errors, &orig_bioc->error);
+	} else {
+		orig_bbio->bio.bi_status = bbio->bio.bi_status;
+	}
+}
+
+static void btrfs_orig_bbio_end_io(struct btrfs_bio *bbio)
+{
+	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
+		struct btrfs_bio *orig_bbio = bbio->private;
+
+		if (bbio->bio.bi_status)
+			btrfs_bbio_propagate_error(bbio, orig_bbio);
+		bio_put(&bbio->bio);
+		bbio = orig_bbio;
+	}
+
+	if (atomic_dec_and_test(&bbio->pending_ios))
+		bbio->end_io(bbio);
+}
+
 static int next_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
 {
 	if (cur_mirror == fbio->num_copies)
@@ -6715,7 +6768,7 @@ static int prev_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
 static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
 {
 	if (atomic_dec_and_test(&fbio->repair_count)) {
-		fbio->bbio->end_io(fbio->bbio);
+		btrfs_orig_bbio_end_io(fbio->bbio);
 		mempool_free(fbio, &btrfs_failed_bio_pool);
 	}
 }
@@ -6857,7 +6910,7 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio,
 	if (unlikely(fbio))
 		btrfs_repair_done(fbio);
 	else
-		bbio->end_io(bbio);
+		btrfs_orig_bbio_end_io(bbio);
 }
 
 static void btrfs_log_dev_io_error(struct bio *bio, struct btrfs_device *dev)
@@ -6908,7 +6961,7 @@ static void btrfs_simple_end_io(struct bio *bio)
 	} else {
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
 			btrfs_record_physical_zoned(bbio);
-		bbio->end_io(bbio);
+		btrfs_orig_bbio_end_io(bbio);
 	}
 }
 
@@ -6922,7 +6975,7 @@ static void btrfs_raid56_end_io(struct bio *bio)
 	if (bio_op(bio) == REQ_OP_READ)
 		btrfs_check_read_bio(bbio, NULL);
 	else
-		bbio->end_io(bbio);
+		btrfs_orig_bbio_end_io(bbio);
 
 	btrfs_put_bioc(bioc);
 }
@@ -6949,7 +7002,7 @@ static void btrfs_orig_write_end_io(struct bio *bio)
 	else
 		bio->bi_status = BLK_STS_OK;
 
-	bbio->end_io(bbio);
+	btrfs_orig_bbio_end_io(bbio);
 	btrfs_put_bioc(bioc);
 }
 
@@ -7190,8 +7243,8 @@ static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio,
 	return true;
 }
 
-void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
-		      int mirror_num)
+static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
+			       int mirror_num)
 {
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	u64 logical = bio->bi_iter.bi_sector << 9;
@@ -7207,11 +7260,10 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	if (ret)
 		goto fail;
 
+	map_length = min(map_length, length);
 	if (map_length < length) {
-		btrfs_crit(fs_info,
-			   "mapping failed logical %llu bio len %llu len %llu",
-			   logical, length, map_length);
-		BUG();
+		bio = btrfs_split_bio(bio, map_length);
+		bbio = btrfs_bio(bio);
 	}
 
 	/*
@@ -7222,7 +7274,7 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		bbio->saved_iter = bio->bi_iter;
 		ret = btrfs_lookup_bio_sums(bbio);
 		if (ret)
-			goto fail;
+			goto fail_put_bio;
 	}
 
 	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
@@ -7231,7 +7283,7 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 			ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
 			if (ret)
-				goto fail;
+				goto fail_put_bio;
 		}
 
 		/*
@@ -7243,22 +7295,36 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		    !btrfs_is_data_reloc_root(bi->root)) {
 			if (should_async_write(bbio) &&
 			    btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
-				return;
+				goto done;
 
 			if (bio->bi_opf & REQ_META)
 				ret = btree_csum_one_bio(bbio);
 			else
 				ret = btrfs_csum_one_bio(bbio);
 			if (ret)
-				goto fail;
+				goto fail_put_bio;
 		}
 	}
 
 	__btrfs_submit_bio(bio, bioc, &smap, mirror_num);
-	return;
+done:
+	return map_length == length;
+
+fail_put_bio:
+	if (map_length < length)
+		bio_put(bio);
 fail:
 	btrfs_bio_counter_dec(fs_info);
 	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
+	/* Do not submit another chunk */
+	return true;
+}
+
+void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+		      int mirror_num)
+{
+	while (!btrfs_submit_chunk(fs_info, bio, mirror_num))
+		;
 }
 
 /*
@@ -8858,10 +8924,13 @@ int __init btrfs_bioset_init(void)
 			offsetof(struct btrfs_bio, bio),
 			BIOSET_NEED_BVECS))
 		return -ENOMEM;
+	if (bioset_init(&btrfs_clone_bioset, BIO_POOL_SIZE,
+			offsetof(struct btrfs_bio, bio), 0))
+		goto out_free_bioset;
 	if (bioset_init(&btrfs_repair_bioset, BIO_POOL_SIZE,
 			offsetof(struct btrfs_bio, bio),
 			BIOSET_NEED_BVECS))
-		goto out_free_bioset;
+		goto out_free_clone_bioset;
 	if (mempool_init_kmalloc_pool(&btrfs_failed_bio_pool, BIO_POOL_SIZE,
 				      sizeof(struct btrfs_failed_bio)))
 		goto out_free_repair_bioset;
@@ -8869,6 +8938,8 @@ int __init btrfs_bioset_init(void)
 
 out_free_repair_bioset:
 	bioset_exit(&btrfs_repair_bioset);
+out_free_clone_bioset:
+	bioset_exit(&btrfs_clone_bioset);
 out_free_bioset:
 	bioset_exit(&btrfs_bioset);
 	return -ENOMEM;
@@ -8878,5 +8949,6 @@ void __cold btrfs_bioset_exit(void)
 {
 	mempool_exit(&btrfs_failed_bio_pool);
 	bioset_exit(&btrfs_repair_bioset);
+	bioset_exit(&btrfs_clone_bioset);
 	bioset_exit(&btrfs_bioset);
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 8b248c9bd602b..97877184d0db1 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -386,6 +386,7 @@ struct btrfs_bio {
 
 	/* For internal use in read end I/O handling */
 	unsigned int mirror_num;
+	atomic_t pending_ios;
 	struct work_struct end_io_work;
 
 	/*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 08/17] btrfs: pass the iomap bio to btrfs_submit_bio
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (6 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-07 21:00   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 09/17] btrfs: remove stripe boundary calculation for buffered I/O Christoph Hellwig
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Now that btrfs_submit_bio splits the bio when crossing stripe boundaries,
there is no need for the higher level code to do that manually.

For direct I/O this is really helpful, as btrfs_submit_io can now simply
take the bio allocated by iomap and send it on to btrfs_submit_bio
instead of allocating clones.

For that to work, the bio embedded into struct btrfs_dio_private needs to
become a full btrfs_bio as expected by btrfs_submit_bio.

With this change there is a single work item to offload the entire iomap
bio so the heuristics to skip async processing for bios that were split
isn't needed anymore either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c   | 159 +++++++++------------------------------------
 fs/btrfs/volumes.c |  21 +-----
 fs/btrfs/volumes.h |   7 +-
 3 files changed, 37 insertions(+), 150 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 03953c1f176dd..833ea647f7887 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -69,24 +69,12 @@ struct btrfs_dio_data {
 };
 
 struct btrfs_dio_private {
-	struct inode *inode;
-
-	/*
-	 * Since DIO can use anonymous page, we cannot use page_offset() to
-	 * grab the file offset, thus need a dedicated member for file offset.
-	 */
+	/* Range of I/O */
 	u64 file_offset;
-	/* Used for bio::bi_size */
 	u32 bytes;
 
-	/*
-	 * References to this structure. There is one reference per in-flight
-	 * bio plus one while we're still setting up.
-	 */
-	refcount_t refs;
-
 	/* This must be last */
-	struct bio bio;
+	struct btrfs_bio bbio;
 };
 
 static struct bio_set btrfs_dio_bioset;
@@ -7815,130 +7803,47 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	return ret;
 }
 
-static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
-{
-	/*
-	 * This implies a barrier so that stores to dio_bio->bi_status before
-	 * this and loads of dio_bio->bi_status after this are fully ordered.
-	 */
-	if (!refcount_dec_and_test(&dip->refs))
-		return;
-
-	if (btrfs_op(&dip->bio) == BTRFS_MAP_WRITE) {
-		btrfs_mark_ordered_io_finished(BTRFS_I(dip->inode), NULL,
-					       dip->file_offset, dip->bytes,
-					       !dip->bio.bi_status);
-	} else {
-		unlock_extent(&BTRFS_I(dip->inode)->io_tree,
-			      dip->file_offset,
-			      dip->file_offset + dip->bytes - 1);
-	}
-
-	bio_endio(&dip->bio);
-}
-
-static void btrfs_end_dio_bio(struct btrfs_bio *bbio)
+static void btrfs_dio_end_io(struct btrfs_bio *bbio)
 {
-	struct btrfs_dio_private *dip = bbio->private;
+	struct btrfs_dio_private *dip =
+		container_of(bbio, struct btrfs_dio_private, bbio);
+	struct btrfs_inode *bi = BTRFS_I(bbio->inode);
 	struct bio *bio = &bbio->bio;
-	blk_status_t err = bio->bi_status;
 
-	if (err) {
-		btrfs_warn(BTRFS_I(dip->inode)->root->fs_info,
-			   "direct IO failed ino %llu rw %d,%u sector %#Lx len %u err no %d",
-			   btrfs_ino(BTRFS_I(dip->inode)), bio_op(bio),
-			   bio->bi_opf, bio->bi_iter.bi_sector,
-			   bio->bi_iter.bi_size, err);
-		dip->bio.bi_status = err;
+	if (bio->bi_status) {
+		btrfs_warn(bi->root->fs_info,
+			   "direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d",
+			   btrfs_ino(bi), bio->bi_opf,
+			   dip->file_offset, dip->bytes, bio->bi_status);
 	}
 
-	bio_put(bio);
-	btrfs_dio_private_put(dip);
+	if (btrfs_op(bio) == BTRFS_MAP_WRITE)
+		btrfs_mark_ordered_io_finished(bi, NULL, dip->file_offset,
+					       dip->bytes, !bio->bi_status);
+	else
+		unlock_extent(&bi->io_tree, dip->file_offset,
+			      dip->file_offset + dip->bytes - 1);
+
+	bbio->bio.bi_private = bbio->private;
+	iomap_dio_bio_end_io(bio);
 }
 
-static void btrfs_submit_direct(const struct iomap_iter *iter,
-		struct bio *dio_bio, loff_t file_offset)
+static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
+				loff_t file_offset)
 {
+	struct btrfs_bio *bbio = btrfs_bio(bio);
 	struct btrfs_dio_private *dip =
-		container_of(dio_bio, struct btrfs_dio_private, bio);
-	struct inode *inode = iter->inode;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct bio *bio;
-	u64 start_sector;
-	u64 submit_len;
-	u64 clone_offset = 0;
-	u64 clone_len;
-	u64 logical;
-	int ret;
-	blk_status_t status;
-	struct btrfs_io_geometry geom;
+		container_of(bbio, struct btrfs_dio_private, bbio);
 	struct btrfs_dio_data *dio_data = iter->private;
-	struct extent_map *em = NULL;
-
-	dip->inode = inode;
-	dip->file_offset = file_offset;
-	dip->bytes = dio_bio->bi_iter.bi_size;
-	refcount_set(&dip->refs, 1);
 
-	start_sector = dio_bio->bi_iter.bi_sector;
-	submit_len = dio_bio->bi_iter.bi_size;
+	btrfs_bio_init(bbio, iter->inode, btrfs_dio_end_io, bio->bi_private);
+	bbio->file_offset = file_offset;
 
-	do {
-		logical = start_sector << 9;
-		em = btrfs_get_chunk_map(fs_info, logical, submit_len);
-		if (IS_ERR(em)) {
-			status = errno_to_blk_status(PTR_ERR(em));
-			em = NULL;
-			goto out_err_em;
-		}
-		ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
-					    logical, &geom);
-		if (ret) {
-			status = errno_to_blk_status(ret);
-			goto out_err_em;
-		}
-
-		clone_len = min(submit_len, geom.len);
-		ASSERT(clone_len <= UINT_MAX);
-
-		/*
-		 * This will never fail as it's passing GPF_NOFS and
-		 * the allocation is backed by btrfs_bioset.
-		 */
-		bio = btrfs_bio_clone_partial(dio_bio, clone_offset, clone_len,
-					      inode, btrfs_end_dio_bio, dip);
-		btrfs_bio(bio)->file_offset = file_offset;
-
-		ASSERT(submit_len >= clone_len);
-		submit_len -= clone_len;
-
-		/*
-		 * Increase the count before we submit the bio so we know
-		 * the end IO handler won't happen before we increase the
-		 * count. Otherwise, the dip might get freed before we're
-		 * done setting it up.
-		 *
-		 * We transfer the initial reference to the last bio, so we
-		 * don't need to increment the reference count for the last one.
-		 */
-		if (submit_len > 0)
-			refcount_inc(&dip->refs);
-
-		btrfs_submit_bio(fs_info, bio, 0);
-
-		dio_data->submitted += clone_len;
-		clone_offset += clone_len;
-		start_sector += clone_len >> 9;
-		file_offset += clone_len;
-
-		free_extent_map(em);
-	} while (submit_len > 0);
-	return;
+	dip->file_offset = file_offset;
+	dip->bytes = bio->bi_iter.bi_size;
 
-out_err_em:
-	free_extent_map(em);
-	dio_bio->bi_status = status;
-	btrfs_dio_private_put(dip);
+	dio_data->submitted += bio->bi_iter.bi_size;
+	btrfs_submit_bio(btrfs_sb(iter->inode->i_sb), bio, 0);
 }
 
 static const struct iomap_ops btrfs_dio_iomap_ops = {
@@ -7947,7 +7852,7 @@ static const struct iomap_ops btrfs_dio_iomap_ops = {
 };
 
 static const struct iomap_dio_ops btrfs_dio_ops = {
-	.submit_io		= btrfs_submit_direct,
+	.submit_io		= btrfs_dio_submit_io,
 	.bio_set		= &btrfs_dio_bioset,
 };
 
@@ -8788,7 +8693,7 @@ int __init btrfs_init_cachep(void)
 		goto fail;
 
 	if (bioset_init(&btrfs_dio_bioset, BIO_POOL_SIZE,
-			offsetof(struct btrfs_dio_private, bio),
+			offsetof(struct btrfs_dio_private, bbio.bio),
 			BIOSET_NEED_BVECS))
 		goto fail;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0a2d144c20604..dba8e53101ed9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6655,8 +6655,8 @@ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
  * Initialize a btrfs_bio structure.  This skips the embedded bio itself as it
  * is already initialized by the block layer.
  */
-static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
-			   btrfs_bio_end_io_t end_io, void *private)
+void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
+		    btrfs_bio_end_io_t end_io, void *private)
 {
 	memset(bbio, 0, offsetof(struct btrfs_bio, bio));
 	bbio->inode = inode;
@@ -6683,23 +6683,6 @@ struct bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
 	return bio;
 }
 
-struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
-				    struct inode *inode,
-				    btrfs_bio_end_io_t end_io, void *private)
-{
-	struct bio *bio;
-	struct btrfs_bio *bbio;
-
-	ASSERT(offset <= UINT_MAX && size <= UINT_MAX);
-
-	bio = bio_alloc_clone(orig->bi_bdev, orig, GFP_NOFS, &btrfs_bioset);
-	bbio = btrfs_bio(bio);
-	btrfs_bio_init(bbio, inode, end_io, private);
-
-	bio_trim(bio, offset >> 9, size >> 9);
-	return bio;
-}
-
 static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
 {
 	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 97877184d0db1..82bbc0aa7081d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -404,12 +404,11 @@ static inline struct btrfs_bio *btrfs_bio(struct bio *bio)
 int __init btrfs_bioset_init(void);
 void __cold btrfs_bioset_exit(void);
 
-struct bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
+void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
+		    btrfs_bio_end_io_t end_io, void *private);
+struct bio *btrfs_bio_alloc(unsigned int nr_vecs, unsigned int opf,
 			    struct inode *inode, btrfs_bio_end_io_t end_io,
 			    void *private);
-struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
-				    struct inode *inode,
-				    btrfs_bio_end_io_t end_io, void *private);
 
 static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 09/17] btrfs: remove stripe boundary calculation for buffered I/O
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (7 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 08/17] btrfs: pass the iomap bio to btrfs_submit_bio Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-07 21:04   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O Christoph Hellwig
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

From: Qu Wenruo <wqu@suse.com>

Remove btrfs_bio_ctrl::len_to_stripe_boundary, so that buffer
I/O will no longer limit its bio size according to stripe length
now that btrfs_submit_bio can split bios at stripe boundaries.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: simplify calc_bio_boundaries a little more]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 71 ++++++++++++--------------------------------
 1 file changed, 19 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4c00bdefe5b45..46a3f0e33fb69 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -145,7 +145,6 @@ struct btrfs_bio_ctrl {
 	struct bio *bio;
 	int mirror_num;
 	enum btrfs_compression_type compress_type;
-	u32 len_to_stripe_boundary;
 	u32 len_to_oe_boundary;
 };
 
@@ -2601,7 +2600,7 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 
 	ASSERT(bio);
 	/* The limit should be calculated when bio_ctrl->bio is allocated */
-	ASSERT(bio_ctrl->len_to_oe_boundary && bio_ctrl->len_to_stripe_boundary);
+	ASSERT(bio_ctrl->len_to_oe_boundary);
 	if (bio_ctrl->compress_type != compress_type)
 		return 0;
 
@@ -2637,9 +2636,7 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	if (!contig)
 		return 0;
 
-	real_size = min(bio_ctrl->len_to_oe_boundary,
-			bio_ctrl->len_to_stripe_boundary) - bio_size;
-	real_size = min(real_size, size);
+	real_size = min(bio_ctrl->len_to_oe_boundary - bio_size, size);
 
 	/*
 	 * If real_size is 0, never call bio_add_*_page(), as even size is 0,
@@ -2656,58 +2653,30 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	return ret;
 }
 
-static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
-			       struct btrfs_inode *inode, u64 file_offset)
+static void calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
+			        struct btrfs_inode *inode, u64 file_offset)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_io_geometry geom;
 	struct btrfs_ordered_extent *ordered;
-	struct extent_map *em;
 	u64 logical = (bio_ctrl->bio->bi_iter.bi_sector << SECTOR_SHIFT);
-	int ret;
 
 	/*
-	 * Pages for compressed extent are never submitted to disk directly,
-	 * thus it has no real boundary, just set them to U32_MAX.
-	 *
-	 * The split happens for real compressed bio, which happens in
-	 * btrfs_submit_compressed_read/write().
+	 * Limit the extent to the ordered boundary for Zone Append.
+	 * Compressed bios aren't submitted directly, so it doesn't apply
+	 * to them.
 	 */
-	if (bio_ctrl->compress_type != BTRFS_COMPRESS_NONE) {
-		bio_ctrl->len_to_oe_boundary = U32_MAX;
-		bio_ctrl->len_to_stripe_boundary = U32_MAX;
-		return 0;
-	}
-	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
-	if (IS_ERR(em))
-		return PTR_ERR(em);
-	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
-				    logical, &geom);
-	free_extent_map(em);
-	if (ret < 0) {
-		return ret;
-	}
-	if (geom.len > U32_MAX)
-		bio_ctrl->len_to_stripe_boundary = U32_MAX;
-	else
-		bio_ctrl->len_to_stripe_boundary = (u32)geom.len;
-
-	if (bio_op(bio_ctrl->bio) != REQ_OP_ZONE_APPEND) {
-		bio_ctrl->len_to_oe_boundary = U32_MAX;
-		return 0;
-	}
-
-	/* Ordered extent not yet created, so we're good */
-	ordered = btrfs_lookup_ordered_extent(inode, file_offset);
-	if (!ordered) {
-		bio_ctrl->len_to_oe_boundary = U32_MAX;
-		return 0;
+	if (bio_ctrl->compress_type == BTRFS_COMPRESS_NONE &&
+	    bio_op(bio_ctrl->bio) == REQ_OP_ZONE_APPEND) {
+		ordered = btrfs_lookup_ordered_extent(inode, file_offset);
+		if (ordered) {
+			bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
+					ordered->disk_bytenr +
+					ordered->disk_num_bytes - logical);
+			btrfs_put_ordered_extent(ordered);
+			return;
+		}
 	}
 
-	bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
-		ordered->disk_bytenr + ordered->disk_num_bytes - logical);
-	btrfs_put_ordered_extent(ordered);
-	return 0;
+	bio_ctrl->len_to_oe_boundary = U32_MAX;
 }
 
 static int alloc_new_bio(struct btrfs_inode *inode,
@@ -2734,9 +2703,7 @@ static int alloc_new_bio(struct btrfs_inode *inode,
 		bio->bi_iter.bi_sector = (disk_bytenr + offset) >> SECTOR_SHIFT;
 	bio_ctrl->bio = bio;
 	bio_ctrl->compress_type = compress_type;
-	ret = calc_bio_boundaries(bio_ctrl, inode, file_offset);
-	if (ret < 0)
-		goto error;
+	calc_bio_boundaries(bio_ctrl, inode, file_offset);
 
 	if (wbc) {
 		/*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (8 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 09/17] btrfs: remove stripe boundary calculation for buffered I/O Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01  9:56   ` Johannes Thumshirn
  2022-09-07 21:07   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 11/17] btrfs: remove stripe boundary calculation for encoded I/O Christoph Hellwig
                   ` (9 subsequent siblings)
  19 siblings, 2 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

From: Qu Wenruo <wqu@suse.com>

Stop looking at the stripe boundary in alloc_compressed_bio() now that
that btrfs_submit_bio can split bios, open code the now trivial code
from alloc_compressed_bio() in btrfs_submit_compressed_read and stop
maintaining the pending_ios count for reads as there is always just
a single bio now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: remove more cruft in btrfs_submit_compressed_read]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/compression.c | 131 +++++++++++------------------------------
 1 file changed, 34 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 1f10f86e70557..5e8b75b030ace 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -136,12 +136,15 @@ static int compression_decompress(int type, struct list_head *ws,
 
 static int btrfs_decompress_bio(struct compressed_bio *cb);
 
-static void finish_compressed_bio_read(struct compressed_bio *cb)
+static void end_compressed_bio_read(struct btrfs_bio *bbio)
 {
+	struct compressed_bio *cb = bbio->private;
 	unsigned int index;
 	struct page *page;
 
-	if (cb->status == BLK_STS_OK)
+	if (bbio->bio.bi_status)
+		cb->status = bbio->bio.bi_status;
+	else
 		cb->status = errno_to_blk_status(btrfs_decompress_bio(cb));
 
 	/* Release the compressed pages */
@@ -157,17 +160,6 @@ static void finish_compressed_bio_read(struct compressed_bio *cb)
 	/* Finally free the cb struct */
 	kfree(cb->compressed_pages);
 	kfree(cb);
-}
-
-static void end_compressed_bio_read(struct btrfs_bio *bbio)
-{
-	struct compressed_bio *cb = bbio->private;
-
-	if (bbio->bio.bi_status)
-		cb->status = bbio->bio.bi_status;
-
-	if (refcount_dec_and_test(&cb->pending_ios))
-		finish_compressed_bio_read(cb);
 	bio_put(&bbio->bio);
 }
 
@@ -286,42 +278,30 @@ static void end_compressed_bio_write(struct btrfs_bio *bbio)
  *                      from or written to.
  * @endio_func:         The endio function to call after the IO for compressed data
  *                      is finished.
- * @next_stripe_start:  Return value of logical bytenr of where next stripe starts.
- *                      Let the caller know to only fill the bio up to the stripe
- *                      boundary.
  */
-
-
 static struct bio *alloc_compressed_bio(struct compressed_bio *cb, u64 disk_bytenr,
 					blk_opf_t opf,
-					btrfs_bio_end_io_t endio_func,
-					u64 *next_stripe_start)
+					btrfs_bio_end_io_t endio_func)
 {
-	struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
-	struct btrfs_io_geometry geom;
-	struct extent_map *em;
 	struct bio *bio;
-	int ret;
 
 	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, cb->inode, endio_func, cb);
 	bio->bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
 
-	em = btrfs_get_chunk_map(fs_info, disk_bytenr, fs_info->sectorsize);
-	if (IS_ERR(em)) {
-		bio_put(bio);
-		return ERR_CAST(em);
-	}
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
+		struct extent_map *em;
 
-	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-		bio_set_dev(bio, em->map_lookup->stripes[0].dev->bdev);
+		em = btrfs_get_chunk_map(fs_info, disk_bytenr,
+					 fs_info->sectorsize);
+		if (IS_ERR(em)) {
+			bio_put(bio);
+			return ERR_CAST(em);
+		}
 
-	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), disk_bytenr, &geom);
-	free_extent_map(em);
-	if (ret < 0) {
-		bio_put(bio);
-		return ERR_PTR(ret);
+		bio_set_dev(bio, em->map_lookup->stripes[0].dev->bdev);
+		free_extent_map(em);
 	}
-	*next_stripe_start = disk_bytenr + geom.len;
 	refcount_inc(&cb->pending_ios);
 	return bio;
 }
@@ -348,7 +328,6 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	struct bio *bio = NULL;
 	struct compressed_bio *cb;
 	u64 cur_disk_bytenr = disk_start;
-	u64 next_stripe_start;
 	blk_status_t ret = BLK_STS_OK;
 	const bool use_append = btrfs_use_zone_append(inode, disk_start);
 	const enum req_op bio_op = REQ_BTRFS_ONE_ORDERED |
@@ -384,8 +363,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 		/* Allocate new bio if submitted or not yet allocated */
 		if (!bio) {
 			bio = alloc_compressed_bio(cb, cur_disk_bytenr,
-				bio_op | write_flags, end_compressed_bio_write,
-				&next_stripe_start);
+				bio_op | write_flags, end_compressed_bio_write);
 			if (IS_ERR(bio)) {
 				ret = errno_to_blk_status(PTR_ERR(bio));
 				break;
@@ -393,20 +371,12 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			if (blkcg_css)
 				bio->bi_opf |= REQ_CGROUP_PUNT;
 		}
-		/*
-		 * We should never reach next_stripe_start start as we will
-		 * submit comp_bio when reach the boundary immediately.
-		 */
-		ASSERT(cur_disk_bytenr != next_stripe_start);
-
 		/*
 		 * We have various limits on the real read size:
-		 * - stripe boundary
 		 * - page boundary
 		 * - compressed length boundary
 		 */
-		real_size = min_t(u64, U32_MAX, next_stripe_start - cur_disk_bytenr);
-		real_size = min_t(u64, real_size, PAGE_SIZE - offset_in_page(offset));
+		real_size = min_t(u64, U32_MAX, PAGE_SIZE - offset_in_page(offset));
 		real_size = min_t(u64, real_size, compressed_len - offset);
 		ASSERT(IS_ALIGNED(real_size, fs_info->sectorsize));
 
@@ -421,9 +391,6 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			submit = true;
 
 		cur_disk_bytenr += added;
-		/* Reached stripe boundary */
-		if (cur_disk_bytenr == next_stripe_start)
-			submit = true;
 
 		/* Finished the range */
 		if (cur_disk_bytenr == disk_start + compressed_len)
@@ -613,10 +580,9 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	struct extent_map_tree *em_tree;
 	struct compressed_bio *cb;
 	unsigned int compressed_len;
-	struct bio *comp_bio = NULL;
+	struct bio *comp_bio;
 	const u64 disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 cur_disk_byte = disk_bytenr;
-	u64 next_stripe_start;
 	u64 file_offset;
 	u64 em_len;
 	u64 em_start;
@@ -681,37 +647,23 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	/* include any pages we added in add_ra-bio_pages */
 	cb->len = bio->bi_iter.bi_size;
 
+	comp_bio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, cb->inode,
+				   end_compressed_bio_read, cb);
+	comp_bio->bi_iter.bi_sector = cur_disk_byte >> SECTOR_SHIFT;
+
 	while (cur_disk_byte < disk_bytenr + compressed_len) {
 		u64 offset = cur_disk_byte - disk_bytenr;
 		unsigned int index = offset >> PAGE_SHIFT;
 		unsigned int real_size;
 		unsigned int added;
 		struct page *page = cb->compressed_pages[index];
-		bool submit = false;
 
-		/* Allocate new bio if submitted or not yet allocated */
-		if (!comp_bio) {
-			comp_bio = alloc_compressed_bio(cb, cur_disk_byte,
-					REQ_OP_READ, end_compressed_bio_read,
-					&next_stripe_start);
-			if (IS_ERR(comp_bio)) {
-				cb->status = errno_to_blk_status(PTR_ERR(comp_bio));
-				break;
-			}
-		}
-		/*
-		 * We should never reach next_stripe_start start as we will
-		 * submit comp_bio when reach the boundary immediately.
-		 */
-		ASSERT(cur_disk_byte != next_stripe_start);
 		/*
 		 * We have various limit on the real read size:
-		 * - stripe boundary
 		 * - page boundary
 		 * - compressed length boundary
 		 */
-		real_size = min_t(u64, U32_MAX, next_stripe_start - cur_disk_byte);
-		real_size = min_t(u64, real_size, PAGE_SIZE - offset_in_page(offset));
+		real_size = min_t(u64, U32_MAX, PAGE_SIZE - offset_in_page(offset));
 		real_size = min_t(u64, real_size, compressed_len - offset);
 		ASSERT(IS_ALIGNED(real_size, fs_info->sectorsize));
 
@@ -722,32 +674,17 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		 */
 		ASSERT(added == real_size);
 		cur_disk_byte += added;
-
-		/* Reached stripe boundary, need to submit */
-		if (cur_disk_byte == next_stripe_start)
-			submit = true;
-
-		/* Has finished the range, need to submit */
-		if (cur_disk_byte == disk_bytenr + compressed_len)
-			submit = true;
-
-		if (submit) {
-			/*
-			 * Save the initial offset of this chunk, as there
-			 * is no direct correlation between compressed pages and
-			 * the original file offset.  The field is only used for
-			 * priting error messages.
-			 */
-			btrfs_bio(comp_bio)->file_offset = file_offset;
-
-			ASSERT(comp_bio->bi_iter.bi_size);
-			btrfs_submit_bio(fs_info, comp_bio, mirror_num);
-			comp_bio = NULL;
-		}
 	}
 
-	if (refcount_dec_and_test(&cb->pending_ios))
-		finish_compressed_bio_read(cb);
+	/*
+	 * Just stash the initial offset of this chunk, as there is no direct
+	 * correlation between compressed pages and the original file offset.
+	 * The field is only used for priting error messages anyway.
+	 */
+	btrfs_bio(comp_bio)->file_offset = file_offset;
+
+	ASSERT(comp_bio->bi_iter.bi_size);
+	btrfs_submit_bio(fs_info, comp_bio, mirror_num);
 	return;
 
 fail:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 11/17] btrfs: remove stripe boundary calculation for encoded I/O
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (9 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01  9:58   ` Johannes Thumshirn
  2022-09-07 21:08   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 12/17] btrfs: remove struct btrfs_io_geometry Christoph Hellwig
                   ` (8 subsequent siblings)
  19 siblings, 2 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

From: Qu Wenruo <wqu@suse.com>

Stop looking at the stripe boundary in
btrfs_encoded_read_regular_fill_pages() now that that btrfs_submit_bio
can split bios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 23 ++---------------------
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 833ea647f7887..399381a4f8e69 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10025,7 +10025,6 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 					  u64 file_offset, u64 disk_bytenr,
 					  u64 disk_io_size, struct page **pages)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_encoded_read_private priv = {
 		.inode = inode,
 		.file_offset = file_offset,
@@ -10033,33 +10032,15 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 	};
 	unsigned long i = 0;
 	u64 cur = 0;
-	int ret;
 
 	init_waitqueue_head(&priv.wait);
 	/*
-	 * Submit bios for the extent, splitting due to bio or stripe limits as
-	 * necessary.
+	 * Submit bios for the extent, splitting due to bio limits as necessary.
 	 */
 	while (cur < disk_io_size) {
-		struct extent_map *em;
-		struct btrfs_io_geometry geom;
 		struct bio *bio = NULL;
-		u64 remaining;
+		u64 remaining = disk_io_size - cur;
 
-		em = btrfs_get_chunk_map(fs_info, disk_bytenr + cur,
-					 disk_io_size - cur);
-		if (IS_ERR(em)) {
-			ret = PTR_ERR(em);
-		} else {
-			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
-						    disk_bytenr + cur, &geom);
-			free_extent_map(em);
-		}
-		if (ret) {
-			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
-			break;
-		}
-		remaining = min(geom.len, disk_io_size - cur);
 		while (bio || remaining) {
 			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 12/17] btrfs: remove struct btrfs_io_geometry
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (10 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 11/17] btrfs: remove stripe boundary calculation for encoded I/O Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-07 21:10   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 13/17] btrfs: remove submit_encoded_read_bio Christoph Hellwig
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Now that btrfs_get_io_geometry has a single caller, we can massage it
into a form that is more suitable for that caller and remove the
marshalling into and out of struct btrfs_io_geometry.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/volumes.c | 115 +++++++++++++--------------------------------
 fs/btrfs/volumes.h |  18 -------
 2 files changed, 32 insertions(+), 101 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dba8e53101ed9..e497b63238189 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6269,91 +6269,43 @@ static bool need_full_stripe(enum btrfs_map_op op)
 	return (op == BTRFS_MAP_WRITE || op == BTRFS_MAP_GET_READ_MIRRORS);
 }
 
-/*
- * Calculate the geometry of a particular (address, len) tuple. This
- * information is used to calculate how big a particular bio can get before it
- * straddles a stripe.
- *
- * @fs_info: the filesystem
- * @em:      mapping containing the logical extent
- * @op:      type of operation - write or read
- * @logical: address that we want to figure out the geometry of
- * @io_geom: pointer used to return values
- *
- * Returns < 0 in case a chunk for the given logical address cannot be found,
- * usually shouldn't happen unless @logical is corrupted, 0 otherwise.
- */
-int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *em,
-			  enum btrfs_map_op op, u64 logical,
-			  struct btrfs_io_geometry *io_geom)
+static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
+			    u64 offset, u64 *stripe_nr, u64 *stripe_offset,
+			    u64 *full_stripe_start)
 {
-	struct map_lookup *map;
-	u64 len;
-	u64 offset;
-	u64 stripe_offset;
-	u64 stripe_nr;
-	u32 stripe_len;
-	u64 raid56_full_stripe_start = (u64)-1;
-	int data_stripes;
+	u32 stripe_len = map->stripe_len;
 
 	ASSERT(op != BTRFS_MAP_DISCARD);
 
-	map = em->map_lookup;
-	/* Offset of this logical address in the chunk */
-	offset = logical - em->start;
-	/* Len of a stripe in a chunk */
-	stripe_len = map->stripe_len;
 	/*
-	 * Stripe_nr is where this block falls in
-	 * stripe_offset is the offset of this block in its stripe.
+	 * Stripe_nr is the stripe where this block falls.
+	 * Stripe_offset is the offset of this block in its stripe.
 	 */
-	stripe_nr = div64_u64_rem(offset, stripe_len, &stripe_offset);
-	ASSERT(stripe_offset < U32_MAX);
+	*stripe_nr = div64_u64_rem(offset, stripe_len, stripe_offset);
+	ASSERT(*stripe_offset < U32_MAX);
 
-	data_stripes = nr_data_stripes(map);
+	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+		unsigned long full_stripe_len =
+			stripe_len * nr_data_stripes(map);
 
-	/* Only stripe based profiles needs to check against stripe length. */
-	if (map->type & BTRFS_BLOCK_GROUP_STRIPE_MASK) {
-		u64 max_len = stripe_len - stripe_offset;
+		*full_stripe_start =
+			div64_u64(offset, full_stripe_len) * full_stripe_len;
 
 		/*
-		 * In case of raid56, we need to know the stripe aligned start
+		 * For writes to RAID[56], allow to write a full stripe set, but
+		 * no straddling of stripe sets.
 		 */
-		if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-			unsigned long full_stripe_len = stripe_len * data_stripes;
-			raid56_full_stripe_start = offset;
-
-			/*
-			 * Allow a write of a full stripe, but make sure we
-			 * don't allow straddling of stripes
-			 */
-			raid56_full_stripe_start = div64_u64(raid56_full_stripe_start,
-					full_stripe_len);
-			raid56_full_stripe_start *= full_stripe_len;
-
-			/*
-			 * For writes to RAID[56], allow a full stripeset across
-			 * all disks. For other RAID types and for RAID[56]
-			 * reads, just allow a single stripe (on a single disk).
-			 */
-			if (op == BTRFS_MAP_WRITE) {
-				max_len = stripe_len * data_stripes -
-					  (offset - raid56_full_stripe_start);
-			}
-		}
-		len = min_t(u64, em->len - offset, max_len);
-	} else {
-		len = em->len - offset;
+		if (op == BTRFS_MAP_WRITE)
+			return full_stripe_len - (offset - *full_stripe_start);
 	}
 
-	io_geom->len = len;
-	io_geom->offset = offset;
-	io_geom->stripe_len = stripe_len;
-	io_geom->stripe_nr = stripe_nr;
-	io_geom->stripe_offset = stripe_offset;
-	io_geom->raid56_stripe_offset = raid56_full_stripe_start;
-
-	return 0;
+	/*
+	 * For other RAID types and for RAID[56] reads, just allow a single
+	 * stripe (on a single disk).
+	 */
+	if (map->type & BTRFS_BLOCK_GROUP_STRIPE_MASK)
+		return stripe_len - *stripe_offset;
+	return U64_MAX;
 }
 
 static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
@@ -6372,6 +6324,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 {
 	struct extent_map *em;
 	struct map_lookup *map;
+	u64 map_offset;
 	u64 stripe_offset;
 	u64 stripe_nr;
 	u64 stripe_len;
@@ -6390,7 +6343,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 	int patch_the_first_stripe_for_dev_replace = 0;
 	u64 physical_to_patch_in_first_stripe = 0;
 	u64 raid56_full_stripe_start = (u64)-1;
-	struct btrfs_io_geometry geom;
+	u64 max_len;
 
 	ASSERT(bioc_ret);
 	ASSERT(op != BTRFS_MAP_DISCARD);
@@ -6398,18 +6351,14 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 	em = btrfs_get_chunk_map(fs_info, logical, *length);
 	ASSERT(!IS_ERR(em));
 
-	ret = btrfs_get_io_geometry(fs_info, em, op, logical, &geom);
-	if (ret < 0)
-		return ret;
-
 	map = em->map_lookup;
-
-	*length = geom.len;
-	stripe_len = geom.stripe_len;
-	stripe_nr = geom.stripe_nr;
-	stripe_offset = geom.stripe_offset;
-	raid56_full_stripe_start = geom.raid56_stripe_offset;
 	data_stripes = nr_data_stripes(map);
+	stripe_len = map->stripe_len;
+
+	map_offset = logical - em->start;
+	max_len = btrfs_max_io_len(map, op, map_offset, &stripe_nr,
+				   &stripe_offset, &raid56_full_stripe_start);
+	*length = min_t(u64, em->len - map_offset, max_len);
 
 	down_read(&dev_replace->rwsem);
 	dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 82bbc0aa7081d..3b1fe04ff078e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -51,21 +51,6 @@ enum btrfs_raid_types {
 	BTRFS_NR_RAID_TYPES
 };
 
-struct btrfs_io_geometry {
-	/* remaining bytes before crossing a stripe */
-	u64 len;
-	/* offset of logical address in chunk */
-	u64 offset;
-	/* length of single IO stripe */
-	u32 stripe_len;
-	/* offset of address in stripe */
-	u32 stripe_offset;
-	/* number of stripe where address falls */
-	u64 stripe_nr;
-	/* offset of raid56 stripe into the chunk */
-	u64 raid56_stripe_offset;
-};
-
 /*
  * Use sequence counter to get consistent device stat data on
  * 32-bit processors.
@@ -568,9 +553,6 @@ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
 					       u64 logical, u64 *length_ret,
 					       u32 *num_stripes);
-int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *map,
-			  enum btrfs_map_op op, u64 logical,
-			  struct btrfs_io_geometry *io_geom);
 int btrfs_read_sys_array(struct btrfs_fs_info *fs_info);
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info);
 struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 13/17] btrfs: remove submit_encoded_read_bio
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (11 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 12/17] btrfs: remove struct btrfs_io_geometry Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01 10:02   ` Johannes Thumshirn
  2022-09-07 21:11   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 14/17] btrfs: remove now spurious bio submission helpers Christoph Hellwig
                   ` (6 subsequent siblings)
  19 siblings, 2 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Just opencode the functionality in the only caller and remove the
now superflous error handling there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 399381a4f8e69..25194e75c0812 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9990,17 +9990,6 @@ struct btrfs_encoded_read_private {
 	blk_status_t status;
 };
 
-static blk_status_t submit_encoded_read_bio(struct btrfs_inode *inode,
-					    struct bio *bio, int mirror_num)
-{
-	struct btrfs_encoded_read_private *priv = btrfs_bio(bio)->private;
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-
-	atomic_inc(&priv->pending);
-	btrfs_submit_bio(fs_info, bio, mirror_num);
-	return BLK_STS_OK;
-}
-
 static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 {
 	struct btrfs_encoded_read_private *priv = bbio->private;
@@ -10025,6 +10014,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 					  u64 file_offset, u64 disk_bytenr,
 					  u64 disk_io_size, struct page **pages)
 {
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_encoded_read_private priv = {
 		.inode = inode,
 		.file_offset = file_offset,
@@ -10055,14 +10045,8 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 
 			if (!bytes ||
 			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
-				blk_status_t status;
-
-				status = submit_encoded_read_bio(inode, bio, 0);
-				if (status) {
-					WRITE_ONCE(priv.status, status);
-					bio_put(bio);
-					goto out;
-				}
+				atomic_inc(&priv.pending);
+				btrfs_submit_bio(fs_info, bio, 0);
 				bio = NULL;
 				continue;
 			}
@@ -10073,7 +10057,6 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 		}
 	}
 
-out:
 	if (atomic_dec_return(&priv.pending))
 		io_wait_event(priv.wait, !atomic_read(&priv.pending));
 	/* See btrfs_encoded_read_endio() for ordering. */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 14/17] btrfs: remove now spurious bio submission helpers
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (12 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 13/17] btrfs: remove submit_encoded_read_bio Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01 10:14   ` Johannes Thumshirn
  2022-09-07 21:12   ` Josef Bacik
  2022-09-01  7:42 ` [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode Christoph Hellwig
                   ` (5 subsequent siblings)
  19 siblings, 2 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

Just call btrfs_submit_bio and btrfs_submit_compressed_read directly from
submit_one_bio now that all additional functionality has moved into
btrfs_submit_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h     |  3 ---
 fs/btrfs/disk-io.c   |  6 ------
 fs/btrfs/disk-io.h   |  1 -
 fs/btrfs/extent_io.c | 11 ++++++-----
 fs/btrfs/inode.c     | 22 ----------------------
 5 files changed, 6 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 33c3c394e43e3..5e57e3c6a1fd6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3372,9 +3372,6 @@ void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_siz
 u64 btrfs_file_extent_end(const struct btrfs_path *path);
 
 /* inode.c */
-void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirror_num);
-void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
-			int mirror_num, enum btrfs_compression_type compress_type);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ceee039b65ea0..014c06c74155f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -648,12 +648,6 @@ int btree_csum_one_bio(struct btrfs_bio *bbio)
 	return ret;
 }
 
-void btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio, int mirror_num)
-{
-	bio->bi_opf |= REQ_META;
-	btrfs_submit_bio(btrfs_sb(inode->i_sb), bio, mirror_num);
-}
-
 #ifdef CONFIG_MIGRATION
 static int btree_migrate_folio(struct address_space *mapping,
 		struct folio *dst, struct folio *src, enum migrate_mode mode)
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 9d4e0e36f7bb9..3a7ef2352c968 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -80,7 +80,6 @@ void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
 int btrfs_validate_metadata_buffer(struct btrfs_bio *bbio,
 				   struct page *page, u64 start, u64 end,
 				   int mirror);
-void btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio, int mirror_num);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_root *btrfs_alloc_dummy_root(struct btrfs_fs_info *fs_info);
 #endif
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 46a3f0e33fb69..33e80f8dd0b1b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -198,12 +198,13 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
 	btrfs_bio(bio)->file_offset = page_offset(bv->bv_page) + bv->bv_offset;
 
 	if (!is_data_inode(inode))
-		btrfs_submit_metadata_bio(inode, bio, mirror_num);
-	else if (btrfs_op(bio) == BTRFS_MAP_WRITE)
-		btrfs_submit_data_write_bio(inode, bio, mirror_num);
+		bio->bi_opf |= REQ_META;
+
+	if (btrfs_op(bio) == BTRFS_MAP_READ &&
+	    bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
+		btrfs_submit_compressed_read(inode, bio, mirror_num);
 	else
-		btrfs_submit_data_read_bio(inode, bio, mirror_num,
-					   bio_ctrl->compress_type);
+		btrfs_submit_bio(btrfs_sb(inode->i_sb), bio, mirror_num);
 
 	/* The bio is owned by the end_io handler now */
 	bio_ctrl->bio = NULL;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 25194e75c0812..9c562d36e4570 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2662,28 +2662,6 @@ int btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 	return ret;
 }
 
-void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirror_num)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	btrfs_submit_bio(fs_info, bio, mirror_num);
-}
-
-void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
-			int mirror_num, enum btrfs_compression_type compress_type)
-{
-	if (compress_type != BTRFS_COMPRESS_NONE) {
-		/*
-		 * btrfs_submit_compressed_read will handle completing the bio
-		 * if there were any errors, so just return here.
-		 */
-		btrfs_submit_compressed_read(inode, bio, mirror_num);
-		return;
-	}
-
-	btrfs_submit_bio(btrfs_sb(inode->i_sb), bio, mirror_num);
-}
-
 /*
  * given a list of ordered sums record them in the inode.  This happens
  * at IO completion time based on sums calculated at bio submission time.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (13 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 14/17] btrfs: remove now spurious bio submission helpers Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01 11:28   ` Johannes Thumshirn
  2022-09-02  1:56   ` Damien Le Moal
  2022-09-01  7:42 ` [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio Christoph Hellwig
                   ` (4 subsequent siblings)
  19 siblings, 2 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

To be able to split a write into properly sized zone append commands,
we need a queue_limits structure that contains the least common
denominator suitable for all devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h |  4 +++-
 fs/btrfs/zoned.c | 36 ++++++++++++++++++------------------
 fs/btrfs/zoned.h |  1 -
 3 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5e57e3c6a1fd6..a37129363e184 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1071,8 +1071,10 @@ struct btrfs_fs_info {
 	 */
 	u64 zone_size;
 
-	/* Max size to emit ZONE_APPEND write command */
+	/* Constraints for ZONE_APPEND commands: */
+	struct queue_limits limits;
 	u64 max_zone_append_size;
+
 	struct mutex zoned_meta_io_lock;
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 2638f71eec4b6..6e04fbbd76b92 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -415,16 +415,6 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	nr_sectors = bdev_nr_sectors(bdev);
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
 	zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
-	/*
-	 * We limit max_zone_append_size also by max_segments *
-	 * PAGE_SIZE. Technically, we can have multiple pages per segment. But,
-	 * since btrfs adds the pages one by one to a bio, and btrfs cannot
-	 * increase the metadata reservation even if it increases the number of
-	 * extents, it is safe to stick with the limit.
-	 */
-	zone_info->max_zone_append_size =
-		min_t(u64, (u64)bdev_max_zone_append_sectors(bdev) << SECTOR_SHIFT,
-		      (u64)bdev_max_segments(bdev) << PAGE_SHIFT);
 	if (!IS_ALIGNED(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
 
@@ -646,14 +636,16 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct queue_limits *lim = &fs_info->limits;
 	struct btrfs_device *device;
 	u64 zoned_devices = 0;
 	u64 nr_devices = 0;
 	u64 zone_size = 0;
-	u64 max_zone_append_size = 0;
 	const bool incompat_zoned = btrfs_fs_incompat(fs_info, ZONED);
 	int ret = 0;
 
+	blk_set_stacking_limits(lim);
+
 	/* Count zoned devices */
 	list_for_each_entry(device, &fs_devices->devices, dev_list) {
 		enum blk_zoned_model model;
@@ -685,11 +677,9 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 				ret = -EINVAL;
 				goto out;
 			}
-			if (!max_zone_append_size ||
-			    (zone_info->max_zone_append_size &&
-			     zone_info->max_zone_append_size < max_zone_append_size))
-				max_zone_append_size =
-					zone_info->max_zone_append_size;
+			blk_stack_limits(lim,
+					 &bdev_get_queue(device->bdev)->limits,
+					 0);
 		}
 		nr_devices++;
 	}
@@ -739,8 +729,18 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	}
 
 	fs_info->zone_size = zone_size;
-	fs_info->max_zone_append_size = ALIGN_DOWN(max_zone_append_size,
-						   fs_info->sectorsize);
+	/*
+	 * Also limit max_zone_append_size by max_segments * PAGE_SIZE.
+	 * Technically, we can have multiple pages per segment. But,
+	 * since btrfs adds the pages one by one to a bio, and btrfs cannot
+	 * increase the metadata reservation even if it increases the number of
+	 * extents, it is safe to stick with the limit.
+	 */
+	fs_info->max_zone_append_size = ALIGN_DOWN(
+		min3((u64)lim->max_zone_append_sectors << SECTOR_SHIFT,
+		     (u64)lim->max_sectors << SECTOR_SHIFT,
+		     (u64)lim->max_segments << PAGE_SHIFT),
+		fs_info->sectorsize);
 	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
 	if (fs_info->max_zone_append_size < fs_info->max_extent_size)
 		fs_info->max_extent_size = fs_info->max_zone_append_size;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index cafa639927050..0f22b22fe359f 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -19,7 +19,6 @@ struct btrfs_zoned_device_info {
 	 */
 	u64 zone_size;
 	u8  zone_size_shift;
-	u64 max_zone_append_size;
 	u32 nr_zones;
 	unsigned int max_active_zones;
 	atomic_t active_zones_left;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (14 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-02  1:46   ` Damien Le Moal
                     ` (2 more replies)
  2022-09-01  7:42 ` [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND Christoph Hellwig
                   ` (3 subsequent siblings)
  19 siblings, 3 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

The current btrfs zoned device support is a little cumbersome in the data
I/O path as it requires the callers to not support more I/O than the
supported ZONE_APPEND size by the underlying device.  This leads to a lot
of extra accounting.  Instead change btrfs_submit_bio so that it can take
write bios of arbitrary size and form from the upper layers, and just
split them internally to the ZONE_APPEND queue limits.  Then remove all
the upper layer warts catering to limited write sized on zoned devices,
including the extra refcount in the compressed_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/compression.c | 112 ++++++++---------------------------------
 fs/btrfs/compression.h |   3 --
 fs/btrfs/extent_io.c   |  74 ++++++---------------------
 fs/btrfs/inode.c       |   4 --
 fs/btrfs/volumes.c     |  40 +++++++++------
 fs/btrfs/zoned.c       |  20 --------
 fs/btrfs/zoned.h       |   9 ----
 7 files changed, 62 insertions(+), 200 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5e8b75b030ace..f89cac08dc4a4 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -255,57 +255,14 @@ static void btrfs_finish_compressed_write_work(struct work_struct *work)
 static void end_compressed_bio_write(struct btrfs_bio *bbio)
 {
 	struct compressed_bio *cb = bbio->private;
+	struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
 
-	if (bbio->bio.bi_status)
-		cb->status = bbio->bio.bi_status;
-
-	if (refcount_dec_and_test(&cb->pending_ios)) {
-		struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
+	cb->status = bbio->bio.bi_status;
+	queue_work(fs_info->compressed_write_workers, &cb->write_end_work);
 
-		queue_work(fs_info->compressed_write_workers, &cb->write_end_work);
-	}
 	bio_put(&bbio->bio);
 }
 
-/*
- * Allocate a compressed_bio, which will be used to read/write on-disk
- * (aka, compressed) * data.
- *
- * @cb:                 The compressed_bio structure, which records all the needed
- *                      information to bind the compressed data to the uncompressed
- *                      page cache.
- * @disk_byten:         The logical bytenr where the compressed data will be read
- *                      from or written to.
- * @endio_func:         The endio function to call after the IO for compressed data
- *                      is finished.
- */
-static struct bio *alloc_compressed_bio(struct compressed_bio *cb, u64 disk_bytenr,
-					blk_opf_t opf,
-					btrfs_bio_end_io_t endio_func)
-{
-	struct bio *bio;
-
-	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, cb->inode, endio_func, cb);
-	bio->bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
-
-	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
-		struct extent_map *em;
-
-		em = btrfs_get_chunk_map(fs_info, disk_bytenr,
-					 fs_info->sectorsize);
-		if (IS_ERR(em)) {
-			bio_put(bio);
-			return ERR_CAST(em);
-		}
-
-		bio_set_dev(bio, em->map_lookup->stripes[0].dev->bdev);
-		free_extent_map(em);
-	}
-	refcount_inc(&cb->pending_ios);
-	return bio;
-}
-
 /*
  * worker function to build and submit bios for previously compressed pages.
  * The corresponding pages in the inode should be marked for writeback
@@ -329,16 +286,12 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	struct compressed_bio *cb;
 	u64 cur_disk_bytenr = disk_start;
 	blk_status_t ret = BLK_STS_OK;
-	const bool use_append = btrfs_use_zone_append(inode, disk_start);
-	const enum req_op bio_op = REQ_BTRFS_ONE_ORDERED |
-		(use_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE);
 
 	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
 	       IS_ALIGNED(len, fs_info->sectorsize));
 	cb = kmalloc(sizeof(struct compressed_bio), GFP_NOFS);
 	if (!cb)
 		return BLK_STS_RESOURCE;
-	refcount_set(&cb->pending_ios, 1);
 	cb->status = BLK_STS_OK;
 	cb->inode = &inode->vfs_inode;
 	cb->start = start;
@@ -349,8 +302,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	INIT_WORK(&cb->write_end_work, btrfs_finish_compressed_write_work);
 	cb->nr_pages = nr_pages;
 
-	if (blkcg_css)
+	if (blkcg_css) {
 		kthread_associate_blkcg(blkcg_css);
+		write_flags |= REQ_CGROUP_PUNT;
+	}
+
+	write_flags |= REQ_BTRFS_ONE_ORDERED;
+	bio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_WRITE | write_flags,
+			      cb->inode, end_compressed_bio_write, cb);
+	bio->bi_iter.bi_sector = cur_disk_bytenr >> SECTOR_SHIFT;
 
 	while (cur_disk_bytenr < disk_start + compressed_len) {
 		u64 offset = cur_disk_bytenr - disk_start;
@@ -358,19 +318,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 		unsigned int real_size;
 		unsigned int added;
 		struct page *page = compressed_pages[index];
-		bool submit = false;
-
-		/* Allocate new bio if submitted or not yet allocated */
-		if (!bio) {
-			bio = alloc_compressed_bio(cb, cur_disk_bytenr,
-				bio_op | write_flags, end_compressed_bio_write);
-			if (IS_ERR(bio)) {
-				ret = errno_to_blk_status(PTR_ERR(bio));
-				break;
-			}
-			if (blkcg_css)
-				bio->bi_opf |= REQ_CGROUP_PUNT;
-		}
+
 		/*
 		 * We have various limits on the real read size:
 		 * - page boundary
@@ -380,36 +328,21 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 		real_size = min_t(u64, real_size, compressed_len - offset);
 		ASSERT(IS_ALIGNED(real_size, fs_info->sectorsize));
 
-		if (use_append)
-			added = bio_add_zone_append_page(bio, page, real_size,
-					offset_in_page(offset));
-		else
-			added = bio_add_page(bio, page, real_size,
-					offset_in_page(offset));
-		/* Reached zoned boundary */
-		if (added == 0)
-			submit = true;
-
+		added = bio_add_page(bio, page, real_size, offset_in_page(offset));
+		/*
+		 * Maximum compressed extent is smaller than bio size limit,
+		 * thus bio_add_page() should always success.
+		 */
+		ASSERT(added == real_size);
 		cur_disk_bytenr += added;
-
-		/* Finished the range */
-		if (cur_disk_bytenr == disk_start + compressed_len)
-			submit = true;
-
-		if (submit) {
-			ASSERT(bio->bi_iter.bi_size);
-			btrfs_bio(bio)->file_offset = start;
-			btrfs_submit_bio(fs_info, bio, 0);
-			bio = NULL;
-		}
-		cond_resched();
 	}
 
+	/* Finished the range */
+	ASSERT(bio->bi_iter.bi_size);
+	btrfs_bio(bio)->file_offset = start;
+	btrfs_submit_bio(fs_info, bio, 0);
 	if (blkcg_css)
 		kthread_associate_blkcg(NULL);
-
-	if (refcount_dec_and_test(&cb->pending_ios))
-		finish_compressed_bio_write(cb);
 	return ret;
 }
 
@@ -613,7 +546,6 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		goto out;
 	}
 
-	refcount_set(&cb->pending_ios, 1);
 	cb->status = BLK_STS_OK;
 	cb->inode = inode;
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 1aa02903de697..25876f7a26949 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -30,9 +30,6 @@ static_assert((BTRFS_MAX_COMPRESSED % PAGE_SIZE) == 0);
 #define	BTRFS_ZLIB_DEFAULT_LEVEL		3
 
 struct compressed_bio {
-	/* Number of outstanding bios */
-	refcount_t pending_ios;
-
 	/* Number of compressed pages in the array */
 	unsigned int nr_pages;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 33e80f8dd0b1b..40dadc46e00d8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2597,7 +2597,6 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	u32 real_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig = false;
-	int ret;
 
 	ASSERT(bio);
 	/* The limit should be calculated when bio_ctrl->bio is allocated */
@@ -2646,12 +2645,7 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	if (real_size == 0)
 		return 0;
 
-	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-		ret = bio_add_zone_append_page(bio, page, real_size, pg_offset);
-	else
-		ret = bio_add_page(bio, page, real_size, pg_offset);
-
-	return ret;
+	return bio_add_page(bio, page, real_size, pg_offset);
 }
 
 static void calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
@@ -2666,7 +2660,7 @@ static void calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 	 * to them.
 	 */
 	if (bio_ctrl->compress_type == BTRFS_COMPRESS_NONE &&
-	    bio_op(bio_ctrl->bio) == REQ_OP_ZONE_APPEND) {
+	    btrfs_use_zone_append(inode, logical)) {
 		ordered = btrfs_lookup_ordered_extent(inode, file_offset);
 		if (ordered) {
 			bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
@@ -2680,17 +2674,15 @@ static void calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 	bio_ctrl->len_to_oe_boundary = U32_MAX;
 }
 
-static int alloc_new_bio(struct btrfs_inode *inode,
-			 struct btrfs_bio_ctrl *bio_ctrl,
-			 struct writeback_control *wbc,
-			 blk_opf_t opf,
-			 btrfs_bio_end_io_t end_io_func,
-			 u64 disk_bytenr, u32 offset, u64 file_offset,
-			 enum btrfs_compression_type compress_type)
+static void alloc_new_bio(struct btrfs_inode *inode,
+			  struct btrfs_bio_ctrl *bio_ctrl,
+			  struct writeback_control *wbc, blk_opf_t opf,
+			  btrfs_bio_end_io_t end_io_func,
+			  u64 disk_bytenr, u32 offset, u64 file_offset,
+			  enum btrfs_compression_type compress_type)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct bio *bio;
-	int ret;
 
 	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, &inode->vfs_inode, end_io_func,
 			      NULL);
@@ -2708,40 +2700,14 @@ static int alloc_new_bio(struct btrfs_inode *inode,
 
 	if (wbc) {
 		/*
-		 * For Zone append we need the correct block_device that we are
-		 * going to write to set in the bio to be able to respect the
-		 * hardware limitation.  Look it up here:
+		 * Pick the last added device to support cgroup writeback.  For
+		 * multi-device file systems this means blk-cgroup policies have
+		 * to always be set on the last added/replaced device.
+		 * This is a bit odd but has been like that for a long time.
 		 */
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			struct btrfs_device *dev;
-
-			dev = btrfs_zoned_get_device(fs_info, disk_bytenr,
-						     fs_info->sectorsize);
-			if (IS_ERR(dev)) {
-				ret = PTR_ERR(dev);
-				goto error;
-			}
-
-			bio_set_dev(bio, dev->bdev);
-		} else {
-			/*
-			 * Otherwise pick the last added device to support
-			 * cgroup writeback.  For multi-device file systems this
-			 * means blk-cgroup policies have to always be set on the
-			 * last added/replaced device.  This is a bit odd but has
-			 * been like that for a long time.
-			 */
-			bio_set_dev(bio, fs_info->fs_devices->latest_dev->bdev);
-		}
+		bio_set_dev(bio, fs_info->fs_devices->latest_dev->bdev);
 		wbc_init_bio(wbc, bio);
-	} else {
-		ASSERT(bio_op(bio) != REQ_OP_ZONE_APPEND);
 	}
-	return 0;
-error:
-	bio_ctrl->bio = NULL;
-	btrfs_bio_end_io(btrfs_bio(bio), errno_to_blk_status(ret));
-	return ret;
 }
 
 /*
@@ -2767,7 +2733,6 @@ static int submit_extent_page(blk_opf_t opf,
 			      enum btrfs_compression_type compress_type,
 			      bool force_bio_submit)
 {
-	int ret = 0;
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 	unsigned int cur = pg_offset;
 
@@ -2784,12 +2749,9 @@ static int submit_extent_page(blk_opf_t opf,
 
 		/* Allocate new bio if needed */
 		if (!bio_ctrl->bio) {
-			ret = alloc_new_bio(inode, bio_ctrl, wbc, opf,
-					    end_io_func, disk_bytenr, offset,
-					    page_offset(page) + cur,
-					    compress_type);
-			if (ret < 0)
-				return ret;
+			alloc_new_bio(inode, bio_ctrl, wbc, opf, end_io_func,
+				      disk_bytenr, offset,
+				      page_offset(page) + cur, compress_type);
 		}
 		/*
 		 * We must go through btrfs_bio_add_page() to ensure each
@@ -3354,10 +3316,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		 * find_next_dirty_byte() are all exclusive
 		 */
 		iosize = min(min(em_end, end + 1), dirty_range_end) - cur;
-
-		if (btrfs_use_zone_append(inode, em->block_start))
-			op = REQ_OP_ZONE_APPEND;
-
 		free_extent_map(em);
 		em = NULL;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9c562d36e4570..1a0bf381f2437 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7727,10 +7727,6 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	iomap->offset = start;
 	iomap->bdev = fs_info->fs_devices->latest_dev->bdev;
 	iomap->length = len;
-
-	if (write && btrfs_use_zone_append(BTRFS_I(inode), em->block_start))
-		iomap->flags |= IOMAP_F_ZONE_APPEND;
-
 	free_extent_map(em);
 
 	return 0;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e497b63238189..0d828b58cc9c3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6632,13 +6632,22 @@ struct bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
 	return bio;
 }
 
-static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
+static struct bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
+				   struct bio *orig, u64 map_length,
+				   bool use_append)
 {
 	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
 	struct bio *bio;
 
-	bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
-			&btrfs_clone_bioset);
+	if (use_append) {
+		unsigned int nr_segs;
+
+		bio = bio_split_rw(orig, &fs_info->limits, &nr_segs,
+				   &btrfs_clone_bioset, map_length);
+	} else {
+		bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
+				&btrfs_clone_bioset);
+	}
 	btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode, NULL, orig_bbio);
 
 	btrfs_bio(bio)->file_offset = orig_bbio->file_offset;
@@ -6970,16 +6979,10 @@ static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
 	 */
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 		u64 physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+		u64 zone_start = round_down(physical, dev->fs_info->zone_size);
 
-		if (btrfs_dev_is_sequential(dev, physical)) {
-			u64 zone_start = round_down(physical,
-						    dev->fs_info->zone_size);
-
-			bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
-		} else {
-			bio->bi_opf &= ~REQ_OP_ZONE_APPEND;
-			bio->bi_opf |= REQ_OP_WRITE;
-		}
+		ASSERT(btrfs_dev_is_sequential(dev, physical));
+		bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
 	}
 	btrfs_debug_in_rcu(dev->fs_info,
 	"%s: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
@@ -7179,9 +7182,11 @@ static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
 			       int mirror_num)
 {
 	struct btrfs_bio *bbio = btrfs_bio(bio);
+	struct btrfs_inode *bi = BTRFS_I(bbio->inode);
 	u64 logical = bio->bi_iter.bi_sector << 9;
 	u64 length = bio->bi_iter.bi_size;
 	u64 map_length = length;
+	bool use_append = btrfs_use_zone_append(bi, logical);
 	struct btrfs_io_context *bioc = NULL;
 	struct btrfs_io_stripe smap;
 	int ret;
@@ -7193,8 +7198,11 @@ static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
 		goto fail;
 
 	map_length = min(map_length, length);
+	if (use_append)
+		map_length = min(map_length, fs_info->max_zone_append_size);
+
 	if (map_length < length) {
-		bio = btrfs_split_bio(bio, map_length);
+		bio = btrfs_split_bio(fs_info, bio, map_length, use_append);
 		bbio = btrfs_bio(bio);
 	}
 
@@ -7210,9 +7218,9 @@ static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
 	}
 
 	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
-		struct btrfs_inode *bi = BTRFS_I(bbio->inode);
-
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		if (use_append) {
+			bio->bi_opf &= ~REQ_OP_WRITE;
+			bio->bi_opf |= REQ_OP_ZONE_APPEND;
 			ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
 			if (ret)
 				goto fail_put_bio;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 6e04fbbd76b92..988e9fc5a6b7b 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1818,26 +1818,6 @@ int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 	return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length);
 }
 
-struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
-					    u64 logical, u64 length)
-{
-	struct btrfs_device *device;
-	struct extent_map *em;
-	struct map_lookup *map;
-
-	em = btrfs_get_chunk_map(fs_info, logical, length);
-	if (IS_ERR(em))
-		return ERR_CAST(em);
-
-	map = em->map_lookup;
-	/* We only support single profile for now */
-	device = map->stripes[0].dev;
-
-	free_extent_map(em);
-
-	return device;
-}
-
 /**
  * Activate block group and underlying device zones
  *
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 0f22b22fe359f..74153ab52169f 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -64,8 +64,6 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
 int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 				  u64 physical_start, u64 physical_pos);
-struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
-					    u64 logical, u64 length);
 bool btrfs_zone_activate(struct btrfs_block_group *block_group);
 int btrfs_zone_finish(struct btrfs_block_group *block_group);
 bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags);
@@ -209,13 +207,6 @@ static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev,
 	return -EOPNOTSUPP;
 }
 
-static inline struct btrfs_device *btrfs_zoned_get_device(
-						  struct btrfs_fs_info *fs_info,
-						  u64 logical, u64 length)
-{
-	return ERR_PTR(-EOPNOTSUPP);
-}
-
 static inline bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 {
 	return true;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (15 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio Christoph Hellwig
@ 2022-09-01  7:42 ` Christoph Hellwig
  2022-09-01 10:46   ` Johannes Thumshirn
                     ` (2 more replies)
  2022-09-02 15:18 ` consolidate btrfs checksumming, repair and bio splitting Johannes Thumshirn
                   ` (2 subsequent siblings)
  19 siblings, 3 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-01  7:42 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

No users left now that btrfs takes REQ_OP_WRITE bios from iomap and
splits and converts them to REQ_OP_ZONE_APPEND internally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/direct-io.c  | 10 ++--------
 include/linux/iomap.h |  1 -
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 4eb559a16c9ed..9e883a9f80388 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -217,16 +217,10 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 {
 	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
 
-	if (!(dio->flags & IOMAP_DIO_WRITE)) {
-		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
+	if (!(dio->flags & IOMAP_DIO_WRITE))
 		return REQ_OP_READ;
-	}
-
-	if (iomap->flags & IOMAP_F_ZONE_APPEND)
-		opflags |= REQ_OP_ZONE_APPEND;
-	else
-		opflags |= REQ_OP_WRITE;
 
+	opflags |= REQ_OP_WRITE;
 	if (use_fua)
 		opflags |= REQ_FUA;
 	else
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 238a03087e17e..ee6d511ef29dd 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -55,7 +55,6 @@ struct vm_fault;
 #define IOMAP_F_SHARED		0x04
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
-#define IOMAP_F_ZONE_APPEND	0x20
 
 /*
  * Flags set by the core iomap code during operations:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/17] block: export bio_split_rw
  2022-09-01  7:42 ` [PATCH 01/17] block: export bio_split_rw Christoph Hellwig
@ 2022-09-01  8:02   ` Johannes Thumshirn
  2022-09-01  8:54   ` Qu Wenruo
  2022-09-07 17:51   ` Josef Bacik
  2 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01  8:02 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/17] block: export bio_split_rw
  2022-09-01  7:42 ` [PATCH 01/17] block: export bio_split_rw Christoph Hellwig
  2022-09-01  8:02   ` Johannes Thumshirn
@ 2022-09-01  8:54   ` Qu Wenruo
  2022-09-05  6:44     ` Christoph Hellwig
  2022-09-07 17:51   ` Josef Bacik
  2 siblings, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-09-01  8:54 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel



On 2022/9/1 15:42, Christoph Hellwig wrote:
> bio_split_rw can be used by file systems to split and incoming write
> bio into multiple bios fitting the hardware limit for use as ZONE_APPEND
> bios.  Export it for initial use in btrfs.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-merge.c   | 3 ++-
>   include/linux/bio.h | 4 ++++
>   2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index ff04e9290715a..e68295462977b 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -267,7 +267,7 @@ static bool bvec_split_segs(struct queue_limits *lim, const struct bio_vec *bv,
>    * responsible for ensuring that @bs is only destroyed after processing of the
>    * split bio has finished.
>    */
> -static struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,
> +struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,

I found the queue_limits structure pretty scary, while we only have very
limited members used in this case:

- lim->virt_boundary_mask
   Used in bvec_gap_to_prev()

- lim->max_segments

- lim->seg_boundary_mask
- lim->max_segment_size
   Used in bvec_split_segs()

- lim->logical_block_size

Not familiar with block layer, thus I'm wondering do btrfs really need a
full queue_limits structure to call bio_split_rw().

Or can we have a simplified wrapper?

IIRC inside btrfs we only need two cases for bio split:

- Split for stripe boundary

- Split for OE/zoned boundary

Thanks,
Qu

>   		unsigned *segs, struct bio_set *bs, unsigned max_bytes)
>   {
>   	struct bio_vec bv, bvprv, *bvprvp = NULL;
> @@ -317,6 +317,7 @@ static struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,
>   	bio_clear_polled(bio);
>   	return bio_split(bio, bytes >> SECTOR_SHIFT, GFP_NOIO, bs);
>   }
> +EXPORT_SYMBOL_GPL(bio_split_rw);
>
>   /**
>    * __bio_split_to_limits - split a bio to fit the queue limits
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index ca22b06700a94..46890f8235401 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -12,6 +12,8 @@
>
>   #define BIO_MAX_VECS		256U
>
> +struct queue_limits;
> +
>   static inline unsigned int bio_max_segs(unsigned int nr_segs)
>   {
>   	return min(nr_segs, BIO_MAX_VECS);
> @@ -375,6 +377,8 @@ static inline void bip_set_seed(struct bio_integrity_payload *bip,
>   void bio_trim(struct bio *bio, sector_t offset, sector_t size);
>   extern struct bio *bio_split(struct bio *bio, int sectors,
>   			     gfp_t gfp, struct bio_set *bs);
> +struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,
> +		unsigned *segs, struct bio_set *bs, unsigned max_bytes);
>
>   /**
>    * bio_next_split - get next @sectors from a bio, splitting if necessary

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 02/17] btrfs: stop tracking failed reads in the I/O tree
  2022-09-01  7:42 ` [PATCH 02/17] btrfs: stop tracking failed reads in the I/O tree Christoph Hellwig
@ 2022-09-01  8:55   ` Qu Wenruo
  2022-09-07 17:52   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Qu Wenruo @ 2022-09-01  8:55 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel



On 2022/9/1 15:42, Christoph Hellwig wrote:
> There is a separate I/O failure tree to track the fail reads, so remove
> the extra EXTENT_DAMAGED bit in the I/O tree.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Reducing extent flags is already a good thing.

Thanks,
Qu
> ---
>   fs/btrfs/extent-io-tree.h        |  1 -
>   fs/btrfs/extent_io.c             | 16 +---------------
>   fs/btrfs/tests/extent-io-tests.c |  1 -
>   include/trace/events/btrfs.h     |  1 -
>   4 files changed, 1 insertion(+), 18 deletions(-)
>
> diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
> index ec2f8b8e6faa7..e218bb56d86ac 100644
> --- a/fs/btrfs/extent-io-tree.h
> +++ b/fs/btrfs/extent-io-tree.h
> @@ -17,7 +17,6 @@ struct io_failure_record;
>   #define EXTENT_NODATASUM	(1U << 7)
>   #define EXTENT_CLEAR_META_RESV	(1U << 8)
>   #define EXTENT_NEED_WAIT	(1U << 9)
> -#define EXTENT_DAMAGED		(1U << 10)
>   #define EXTENT_NORESERVE	(1U << 11)
>   #define EXTENT_QGROUP_RESERVED	(1U << 12)
>   #define EXTENT_CLEAR_DATA_RESV	(1U << 13)
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 591c191a58bc9..6ac76534d2c9e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2280,23 +2280,13 @@ int free_io_failure(struct extent_io_tree *failure_tree,
>   		    struct io_failure_record *rec)
>   {
>   	int ret;
> -	int err = 0;
>
>   	set_state_failrec(failure_tree, rec->start, NULL);
>   	ret = clear_extent_bits(failure_tree, rec->start,
>   				rec->start + rec->len - 1,
>   				EXTENT_LOCKED | EXTENT_DIRTY);
> -	if (ret)
> -		err = ret;
> -
> -	ret = clear_extent_bits(io_tree, rec->start,
> -				rec->start + rec->len - 1,
> -				EXTENT_DAMAGED);
> -	if (ret && !err)
> -		err = ret;
> -
>   	kfree(rec);
> -	return err;
> +	return ret;
>   }
>
>   /*
> @@ -2521,7 +2511,6 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>   	u64 start = bbio->file_offset + bio_offset;
>   	struct io_failure_record *failrec;
>   	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
> -	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>   	const u32 sectorsize = fs_info->sectorsize;
>   	int ret;
>
> @@ -2573,9 +2562,6 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>   			      EXTENT_LOCKED | EXTENT_DIRTY);
>   	if (ret >= 0) {
>   		ret = set_state_failrec(failure_tree, start, failrec);
> -		/* Set the bits in the inode's tree */
> -		ret = set_extent_bits(tree, start, start + sectorsize - 1,
> -				      EXTENT_DAMAGED);
>   	} else if (ret < 0) {
>   		kfree(failrec);
>   		return ERR_PTR(ret);
> diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
> index a232b15b8021f..ba4b7601e8c0a 100644
> --- a/fs/btrfs/tests/extent-io-tests.c
> +++ b/fs/btrfs/tests/extent-io-tests.c
> @@ -80,7 +80,6 @@ static void extent_flag_to_str(const struct extent_state *state, char *dest)
>   	PRINT_ONE_FLAG(state, dest, cur, NODATASUM);
>   	PRINT_ONE_FLAG(state, dest, cur, CLEAR_META_RESV);
>   	PRINT_ONE_FLAG(state, dest, cur, NEED_WAIT);
> -	PRINT_ONE_FLAG(state, dest, cur, DAMAGED);
>   	PRINT_ONE_FLAG(state, dest, cur, NORESERVE);
>   	PRINT_ONE_FLAG(state, dest, cur, QGROUP_RESERVED);
>   	PRINT_ONE_FLAG(state, dest, cur, CLEAR_DATA_RESV);
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 73df80d462dc8..f8a4118b16574 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -154,7 +154,6 @@ FLUSH_STATES
>   	{ EXTENT_NODATASUM,		"NODATASUM"},		\
>   	{ EXTENT_CLEAR_META_RESV,	"CLEAR_META_RESV"},	\
>   	{ EXTENT_NEED_WAIT,		"NEED_WAIT"},		\
> -	{ EXTENT_DAMAGED,		"DAMAGED"},		\
>   	{ EXTENT_NORESERVE,		"NORESERVE"},		\
>   	{ EXTENT_QGROUP_RESERVED,	"QGROUP_RESERVED"},	\
>   	{ EXTENT_CLEAR_DATA_RESV,	"CLEAR_DATA_RESV"},	\

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-01  7:42 ` [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer Christoph Hellwig
@ 2022-09-01  9:04   ` Qu Wenruo
  2022-09-05  6:48     ` Christoph Hellwig
  2022-09-07 18:15   ` Josef Bacik
  1 sibling, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-09-01  9:04 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel



On 2022/9/1 15:42, Christoph Hellwig wrote:
> Currently btrfs handles checksum validation and repair in the end I/O
> handler for the btrfs_bio.  This leads to a lot of duplicate code
> plus issues with variying semantics or bugs, e.g.
>
>   - the until recently completetly broken repair for compressed extents
>   - the fact that encoded reads validate the checksums but do not kick
>     of read repair
>   - the inconsistent checking of the BTRFS_FS_STATE_NO_CSUMS flag
>
> This commit revamps the checksum validation and repair code to instead
> work below the btrfs_submit_bio interfaces.

I'm 100% into the idea of pre-loading the csum at btrfs_submit_bio() time.

That's definitely the way we should go.

> For this to work we need
> to make sure an inode is available, so that is added as a parameter
> to btrfs_bio_alloc.  With that btrfs_submit_bio can preload
> btrfs_bio.csum from the csum tree without help from the upper layers,
> and the low-level I/O completion can iterate over the bio and verify
> the checksums.

But for the verification part, I still don't like the idea of putting
the verification code at endio context at all.

This is especially true when data and metadata are still doing different
checksum verfication at different timing.

Can we just let the endio function to do the IO, and let the reader to
do the verification after all needed data is read out?

Thanks,
Qu

>
> In case of a checksum failure (or a plain old I/O error), the repair
> is now kicked off before the upper level ->end_io handler is invoked.
> Tracking of the repair status is massively simplified by just keeping
> a small failed_bio structure per bio with failed sectors and otherwise
> using the information in the repair bio.  The per-inode I/O failure
> tree can be entirely removed.
>
> The saved bvec_iter in the btrfs_bio is now competely managed by
> btrfs_submit_bio and must not be accessed by the callers.
>
> There is one significant behavior change here:  If repair fails or
> is impossible to start with, the whole bio will be failed to the
> upper layer.  This is the behavior that all I/O submitters execept
> for buffered I/O already emulated in their end_io handler.  For
> buffered I/O this now means that a large readahead request can
> fail due to a single bad sector, but as readahead errors are igored
> the following readpage if the sector is actually accessed will
> still be able to read.  This also matches the I/O failure handling
> in other file systems.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/btrfs_inode.h       |   5 -
>   fs/btrfs/compression.c       |  54 +----
>   fs/btrfs/ctree.h             |  13 +-
>   fs/btrfs/extent-io-tree.h    |  18 --
>   fs/btrfs/extent_io.c         | 451 +----------------------------------
>   fs/btrfs/extent_io.h         |  28 ---
>   fs/btrfs/file-item.c         |  42 ++--
>   fs/btrfs/inode.c             | 320 ++++---------------------
>   fs/btrfs/volumes.c           | 238 ++++++++++++++++--
>   fs/btrfs/volumes.h           |  49 ++--
>   include/trace/events/btrfs.h |   1 -
>   11 files changed, 320 insertions(+), 899 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index b160b8e124e01..4cb9898869019 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -91,11 +91,6 @@ struct btrfs_inode {
>   	/* the io_tree does range state (DIRTY, LOCKED etc) */
>   	struct extent_io_tree io_tree;
>
> -	/* special utility tree used to record which mirrors have already been
> -	 * tried when checksums fail for a given block
> -	 */
> -	struct extent_io_tree io_failure_tree;
> -
>   	/*
>   	 * Keep track of where the inode has extent items mapped in order to
>   	 * make sure the i_size adjustments are accurate
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 1c77de3239bc4..f932415a4f1df 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -159,53 +159,15 @@ static void finish_compressed_bio_read(struct compressed_bio *cb)
>   	kfree(cb);
>   }
>
> -/*
> - * Verify the checksums and kick off repair if needed on the uncompressed data
> - * before decompressing it into the original bio and freeing the uncompressed
> - * pages.
> - */
>   static void end_compressed_bio_read(struct btrfs_bio *bbio)
>   {
>   	struct compressed_bio *cb = bbio->private;
> -	struct inode *inode = cb->inode;
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	struct btrfs_inode *bi = BTRFS_I(inode);
> -	bool csum = !(bi->flags & BTRFS_INODE_NODATASUM) &&
> -		    !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state);
> -	blk_status_t status = bbio->bio.bi_status;
> -	struct bvec_iter iter;
> -	struct bio_vec bv;
> -	u32 offset;
> -
> -	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
> -		u64 start = bbio->file_offset + offset;
> -
> -		if (!status &&
> -		    (!csum || !btrfs_check_data_csum(inode, bbio, offset,
> -						     bv.bv_page, bv.bv_offset))) {
> -			clean_io_failure(fs_info, &bi->io_failure_tree,
> -					 &bi->io_tree, start, bv.bv_page,
> -					 btrfs_ino(bi), bv.bv_offset);
> -		} else {
> -			int ret;
> -
> -			refcount_inc(&cb->pending_ios);
> -			ret = btrfs_repair_one_sector(inode, bbio, offset,
> -						      bv.bv_page, bv.bv_offset,
> -						      btrfs_submit_data_read_bio);
> -			if (ret) {
> -				refcount_dec(&cb->pending_ios);
> -				status = errno_to_blk_status(ret);
> -			}
> -		}
> -	}
>
> -	if (status)
> -		cb->status = status;
> +	if (bbio->bio.bi_status)
> +		cb->status = bbio->bio.bi_status;
>
>   	if (refcount_dec_and_test(&cb->pending_ios))
>   		finish_compressed_bio_read(cb);
> -	btrfs_bio_free_csum(bbio);
>   	bio_put(&bbio->bio);
>   }
>
> @@ -342,7 +304,7 @@ static struct bio *alloc_compressed_bio(struct compressed_bio *cb, u64 disk_byte
>   	struct bio *bio;
>   	int ret;
>
> -	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, endio_func, cb);
> +	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, cb->inode, endio_func, cb);
>   	bio->bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
>
>   	em = btrfs_get_chunk_map(fs_info, disk_bytenr, fs_info->sectorsize);
> @@ -778,10 +740,6 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>   			submit = true;
>
>   		if (submit) {
> -			/* Save the original iter for read repair */
> -			if (bio_op(comp_bio) == REQ_OP_READ)
> -				btrfs_bio(comp_bio)->iter = comp_bio->bi_iter;
> -
>   			/*
>   			 * Save the initial offset of this chunk, as there
>   			 * is no direct correlation between compressed pages and
> @@ -790,12 +748,6 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>   			 */
>   			btrfs_bio(comp_bio)->file_offset = file_offset;
>
> -			ret = btrfs_lookup_bio_sums(inode, comp_bio, NULL);
> -			if (ret) {
> -				btrfs_bio_end_io(btrfs_bio(comp_bio), ret);
> -				break;
> -			}
> -
>   			ASSERT(comp_bio->bi_iter.bi_size);
>   			btrfs_submit_bio(fs_info, comp_bio, mirror_num);
>   			comp_bio = NULL;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0069bc86c04f1..3dcb0d5f8faa0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3344,7 +3344,7 @@ int btrfs_find_orphan_item(struct btrfs_root *root, u64 offset);
>   /* file-item.c */
>   int btrfs_del_csums(struct btrfs_trans_handle *trans,
>   		    struct btrfs_root *root, u64 bytenr, u64 len);
> -blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst);
> +int btrfs_lookup_bio_sums(struct btrfs_bio *bbio);
>   int btrfs_insert_hole_extent(struct btrfs_trans_handle *trans,
>   			     struct btrfs_root *root, u64 objectid, u64 pos,
>   			     u64 num_bytes);
> @@ -3375,15 +3375,8 @@ u64 btrfs_file_extent_end(const struct btrfs_path *path);
>   void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirror_num);
>   void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
>   			int mirror_num, enum btrfs_compression_type compress_type);
> -int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
> -			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
> -int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
> -			  u32 bio_offset, struct page *page, u32 pgoff);
> -unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
> -				    u32 bio_offset, struct page *page,
> -				    u64 start, u64 end);
> -int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
> -			  u32 bio_offset, struct page *page, u32 pgoff);
> +bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
> +			u32 bio_offset, struct bio_vec *bv);
>   struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
>   					   u64 start, u64 len);
>   noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
> index e218bb56d86ac..a1afe6e15943e 100644
> --- a/fs/btrfs/extent-io-tree.h
> +++ b/fs/btrfs/extent-io-tree.h
> @@ -4,7 +4,6 @@
>   #define BTRFS_EXTENT_IO_TREE_H
>
>   struct extent_changeset;
> -struct io_failure_record;
>
>   /* Bits for the extent state */
>   #define EXTENT_DIRTY		(1U << 0)
> @@ -55,7 +54,6 @@ enum {
>   	IO_TREE_FS_EXCLUDED_EXTENTS,
>   	IO_TREE_BTREE_INODE_IO,
>   	IO_TREE_INODE_IO,
> -	IO_TREE_INODE_IO_FAILURE,
>   	IO_TREE_RELOC_BLOCKS,
>   	IO_TREE_TRANS_DIRTY_PAGES,
>   	IO_TREE_ROOT_DIRTY_LOG_PAGES,
> @@ -88,8 +86,6 @@ struct extent_state {
>   	refcount_t refs;
>   	u32 state;
>
> -	struct io_failure_record *failrec;
> -
>   #ifdef CONFIG_BTRFS_DEBUG
>   	struct list_head leak_list;
>   #endif
> @@ -246,18 +242,4 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
>   			       u64 *end, u64 max_bytes,
>   			       struct extent_state **cached_state);
>
> -/* This should be reworked in the future and put elsewhere. */
> -struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start);
> -int set_state_failrec(struct extent_io_tree *tree, u64 start,
> -		      struct io_failure_record *failrec);
> -void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start,
> -		u64 end);
> -int free_io_failure(struct extent_io_tree *failure_tree,
> -		    struct extent_io_tree *io_tree,
> -		    struct io_failure_record *rec);
> -int clean_io_failure(struct btrfs_fs_info *fs_info,
> -		     struct extent_io_tree *failure_tree,
> -		     struct extent_io_tree *io_tree, u64 start,
> -		     struct page *page, u64 ino, unsigned int pg_offset);
> -
>   #endif /* BTRFS_EXTENT_IO_TREE_H */
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c83cc5677a08a..d8c43e2111a99 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -326,7 +326,6 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
>   	if (!state)
>   		return state;
>   	state->state = 0;
> -	state->failrec = NULL;
>   	RB_CLEAR_NODE(&state->rb_node);
>   	btrfs_leak_debug_add(&leak_lock, &state->leak_list, &states);
>   	refcount_set(&state->refs, 1);
> @@ -2159,66 +2158,6 @@ u64 count_range_bits(struct extent_io_tree *tree,
>   	return total_bytes;
>   }
>
> -/*
> - * set the private field for a given byte offset in the tree.  If there isn't
> - * an extent_state there already, this does nothing.
> - */
> -int set_state_failrec(struct extent_io_tree *tree, u64 start,
> -		      struct io_failure_record *failrec)
> -{
> -	struct rb_node *node;
> -	struct extent_state *state;
> -	int ret = 0;
> -
> -	spin_lock(&tree->lock);
> -	/*
> -	 * this search will find all the extents that end after
> -	 * our range starts.
> -	 */
> -	node = tree_search(tree, start);
> -	if (!node) {
> -		ret = -ENOENT;
> -		goto out;
> -	}
> -	state = rb_entry(node, struct extent_state, rb_node);
> -	if (state->start != start) {
> -		ret = -ENOENT;
> -		goto out;
> -	}
> -	state->failrec = failrec;
> -out:
> -	spin_unlock(&tree->lock);
> -	return ret;
> -}
> -
> -struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start)
> -{
> -	struct rb_node *node;
> -	struct extent_state *state;
> -	struct io_failure_record *failrec;
> -
> -	spin_lock(&tree->lock);
> -	/*
> -	 * this search will find all the extents that end after
> -	 * our range starts.
> -	 */
> -	node = tree_search(tree, start);
> -	if (!node) {
> -		failrec = ERR_PTR(-ENOENT);
> -		goto out;
> -	}
> -	state = rb_entry(node, struct extent_state, rb_node);
> -	if (state->start != start) {
> -		failrec = ERR_PTR(-ENOENT);
> -		goto out;
> -	}
> -
> -	failrec = state->failrec;
> -out:
> -	spin_unlock(&tree->lock);
> -	return failrec;
> -}
> -
>   /*
>    * searches a range in the state tree for a given mask.
>    * If 'filled' == 1, this returns 1 only if every extent in the tree
> @@ -2275,258 +2214,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
>   	return bitset;
>   }
>
> -int free_io_failure(struct extent_io_tree *failure_tree,
> -		    struct extent_io_tree *io_tree,
> -		    struct io_failure_record *rec)
> -{
> -	int ret;
> -
> -	set_state_failrec(failure_tree, rec->start, NULL);
> -	ret = clear_extent_bits(failure_tree, rec->start,
> -				rec->start + rec->len - 1,
> -				EXTENT_LOCKED | EXTENT_DIRTY);
> -	kfree(rec);
> -	return ret;
> -}
> -
> -static int next_mirror(const struct io_failure_record *failrec, int cur_mirror)
> -{
> -	if (cur_mirror == failrec->num_copies)
> -		return cur_mirror + 1 - failrec->num_copies;
> -	return cur_mirror + 1;
> -}
> -
> -static int prev_mirror(const struct io_failure_record *failrec, int cur_mirror)
> -{
> -	if (cur_mirror == 1)
> -		return failrec->num_copies;
> -	return cur_mirror - 1;
> -}
> -
> -/*
> - * each time an IO finishes, we do a fast check in the IO failure tree
> - * to see if we need to process or clean up an io_failure_record
> - */
> -int clean_io_failure(struct btrfs_fs_info *fs_info,
> -		     struct extent_io_tree *failure_tree,
> -		     struct extent_io_tree *io_tree, u64 start,
> -		     struct page *page, u64 ino, unsigned int pg_offset)
> -{
> -	u64 private;
> -	struct io_failure_record *failrec;
> -	struct extent_state *state;
> -	int mirror;
> -	int ret;
> -
> -	private = 0;
> -	ret = count_range_bits(failure_tree, &private, (u64)-1, 1,
> -			       EXTENT_DIRTY, 0);
> -	if (!ret)
> -		return 0;
> -
> -	failrec = get_state_failrec(failure_tree, start);
> -	if (IS_ERR(failrec))
> -		return 0;
> -
> -	BUG_ON(!failrec->this_mirror);
> -
> -	if (sb_rdonly(fs_info->sb))
> -		goto out;
> -
> -	spin_lock(&io_tree->lock);
> -	state = find_first_extent_bit_state(io_tree,
> -					    failrec->start,
> -					    EXTENT_LOCKED);
> -	spin_unlock(&io_tree->lock);
> -
> -	if (!state || state->start > failrec->start ||
> -	    state->end < failrec->start + failrec->len - 1)
> -		goto out;
> -
> -	mirror = failrec->this_mirror;
> -	do {
> -		mirror = prev_mirror(failrec, mirror);
> -		btrfs_repair_io_failure(fs_info, ino, start, failrec->len,
> -				  failrec->logical, page, pg_offset, mirror);
> -	} while (mirror != failrec->failed_mirror);
> -
> -out:
> -	free_io_failure(failure_tree, io_tree, failrec);
> -	return 0;
> -}
> -
> -/*
> - * Can be called when
> - * - hold extent lock
> - * - under ordered extent
> - * - the inode is freeing
> - */
> -void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
> -{
> -	struct extent_io_tree *failure_tree = &inode->io_failure_tree;
> -	struct io_failure_record *failrec;
> -	struct extent_state *state, *next;
> -
> -	if (RB_EMPTY_ROOT(&failure_tree->state))
> -		return;
> -
> -	spin_lock(&failure_tree->lock);
> -	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
> -	while (state) {
> -		if (state->start > end)
> -			break;
> -
> -		ASSERT(state->end <= end);
> -
> -		next = next_state(state);
> -
> -		failrec = state->failrec;
> -		free_extent_state(state);
> -		kfree(failrec);
> -
> -		state = next;
> -	}
> -	spin_unlock(&failure_tree->lock);
> -}
> -
> -static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
> -							     struct btrfs_bio *bbio,
> -							     unsigned int bio_offset)
> -{
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	u64 start = bbio->file_offset + bio_offset;
> -	struct io_failure_record *failrec;
> -	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
> -	const u32 sectorsize = fs_info->sectorsize;
> -	int ret;
> -
> -	failrec = get_state_failrec(failure_tree, start);
> -	if (!IS_ERR(failrec)) {
> -		btrfs_debug(fs_info,
> -	"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu",
> -			failrec->logical, failrec->start, failrec->len);
> -		/*
> -		 * when data can be on disk more than twice, add to failrec here
> -		 * (e.g. with a list for failed_mirror) to make
> -		 * clean_io_failure() clean all those errors at once.
> -		 */
> -		ASSERT(failrec->this_mirror == bbio->mirror_num);
> -		ASSERT(failrec->len == fs_info->sectorsize);
> -		return failrec;
> -	}
> -
> -	failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
> -	if (!failrec)
> -		return ERR_PTR(-ENOMEM);
> -
> -	failrec->start = start;
> -	failrec->len = sectorsize;
> -	failrec->failed_mirror = bbio->mirror_num;
> -	failrec->this_mirror = bbio->mirror_num;
> -	failrec->logical = (bbio->iter.bi_sector << SECTOR_SHIFT) + bio_offset;
> -
> -	btrfs_debug(fs_info,
> -		    "new io failure record logical %llu start %llu",
> -		    failrec->logical, start);
> -
> -	failrec->num_copies = btrfs_num_copies(fs_info, failrec->logical, sectorsize);
> -	if (failrec->num_copies == 1) {
> -		/*
> -		 * We only have a single copy of the data, so don't bother with
> -		 * all the retry and error correction code that follows. No
> -		 * matter what the error is, it is very likely to persist.
> -		 */
> -		btrfs_debug(fs_info,
> -			"cannot repair logical %llu num_copies %d",
> -			failrec->logical, failrec->num_copies);
> -		kfree(failrec);
> -		return ERR_PTR(-EIO);
> -	}
> -
> -	/* Set the bits in the private failure tree */
> -	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
> -			      EXTENT_LOCKED | EXTENT_DIRTY);
> -	if (ret >= 0) {
> -		ret = set_state_failrec(failure_tree, start, failrec);
> -	} else if (ret < 0) {
> -		kfree(failrec);
> -		return ERR_PTR(ret);
> -	}
> -
> -	return failrec;
> -}
> -
> -int btrfs_repair_one_sector(struct inode *inode, struct btrfs_bio *failed_bbio,
> -			    u32 bio_offset, struct page *page, unsigned int pgoff,
> -			    submit_bio_hook_t *submit_bio_hook)
> -{
> -	u64 start = failed_bbio->file_offset + bio_offset;
> -	struct io_failure_record *failrec;
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> -	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
> -	struct bio *failed_bio = &failed_bbio->bio;
> -	const int icsum = bio_offset >> fs_info->sectorsize_bits;
> -	struct bio *repair_bio;
> -	struct btrfs_bio *repair_bbio;
> -
> -	btrfs_debug(fs_info,
> -		   "repair read error: read error at %llu", start);
> -
> -	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
> -
> -	failrec = btrfs_get_io_failure_record(inode, failed_bbio, bio_offset);
> -	if (IS_ERR(failrec))
> -		return PTR_ERR(failrec);
> -
> -	/*
> -	 * There are two premises:
> -	 * a) deliver good data to the caller
> -	 * b) correct the bad sectors on disk
> -	 *
> -	 * Since we're only doing repair for one sector, we only need to get
> -	 * a good copy of the failed sector and if we succeed, we have setup
> -	 * everything for btrfs_repair_io_failure to do the rest for us.
> -	 */
> -	failrec->this_mirror = next_mirror(failrec, failrec->this_mirror);
> -	if (failrec->this_mirror == failrec->failed_mirror) {
> -		btrfs_debug(fs_info,
> -			"failed to repair num_copies %d this_mirror %d failed_mirror %d",
> -			failrec->num_copies, failrec->this_mirror, failrec->failed_mirror);
> -		free_io_failure(failure_tree, tree, failrec);
> -		return -EIO;
> -	}
> -
> -	repair_bio = btrfs_bio_alloc(1, REQ_OP_READ, failed_bbio->end_io,
> -				     failed_bbio->private);
> -	repair_bbio = btrfs_bio(repair_bio);
> -	repair_bbio->file_offset = start;
> -	repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
> -
> -	if (failed_bbio->csum) {
> -		const u32 csum_size = fs_info->csum_size;
> -
> -		repair_bbio->csum = repair_bbio->csum_inline;
> -		memcpy(repair_bbio->csum,
> -		       failed_bbio->csum + csum_size * icsum, csum_size);
> -	}
> -
> -	bio_add_page(repair_bio, page, failrec->len, pgoff);
> -	repair_bbio->iter = repair_bio->bi_iter;
> -
> -	btrfs_debug(btrfs_sb(inode->i_sb),
> -		    "repair read error: submitting new read to mirror %d",
> -		    failrec->this_mirror);
> -
> -	/*
> -	 * At this point we have a bio, so any errors from submit_bio_hook()
> -	 * will be handled by the endio on the repair_bio, so we can't return an
> -	 * error here.
> -	 */
> -	submit_bio_hook(inode, repair_bio, failrec->this_mirror, 0);
> -	return BLK_STS_OK;
> -}
> -
>   static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
>   {
>   	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> @@ -2555,84 +2242,6 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
>   		btrfs_subpage_end_reader(fs_info, page, start, len);
>   }
>
> -static void end_sector_io(struct page *page, u64 offset, bool uptodate)
> -{
> -	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
> -	const u32 sectorsize = inode->root->fs_info->sectorsize;
> -	struct extent_state *cached = NULL;
> -
> -	end_page_read(page, uptodate, offset, sectorsize);
> -	if (uptodate)
> -		set_extent_uptodate(&inode->io_tree, offset,
> -				    offset + sectorsize - 1, &cached, GFP_ATOMIC);
> -	unlock_extent_cached_atomic(&inode->io_tree, offset,
> -				    offset + sectorsize - 1, &cached);
> -}
> -
> -static void submit_data_read_repair(struct inode *inode,
> -				    struct btrfs_bio *failed_bbio,
> -				    u32 bio_offset, const struct bio_vec *bvec,
> -				    unsigned int error_bitmap)
> -{
> -	const unsigned int pgoff = bvec->bv_offset;
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	struct page *page = bvec->bv_page;
> -	const u64 start = page_offset(bvec->bv_page) + bvec->bv_offset;
> -	const u64 end = start + bvec->bv_len - 1;
> -	const u32 sectorsize = fs_info->sectorsize;
> -	const int nr_bits = (end + 1 - start) >> fs_info->sectorsize_bits;
> -	int i;
> -
> -	BUG_ON(bio_op(&failed_bbio->bio) == REQ_OP_WRITE);
> -
> -	/* This repair is only for data */
> -	ASSERT(is_data_inode(inode));
> -
> -	/* We're here because we had some read errors or csum mismatch */
> -	ASSERT(error_bitmap);
> -
> -	/*
> -	 * We only get called on buffered IO, thus page must be mapped and bio
> -	 * must not be cloned.
> -	 */
> -	ASSERT(page->mapping && !bio_flagged(&failed_bbio->bio, BIO_CLONED));
> -
> -	/* Iterate through all the sectors in the range */
> -	for (i = 0; i < nr_bits; i++) {
> -		const unsigned int offset = i * sectorsize;
> -		bool uptodate = false;
> -		int ret;
> -
> -		if (!(error_bitmap & (1U << i))) {
> -			/*
> -			 * This sector has no error, just end the page read
> -			 * and unlock the range.
> -			 */
> -			uptodate = true;
> -			goto next;
> -		}
> -
> -		ret = btrfs_repair_one_sector(inode, failed_bbio,
> -				bio_offset + offset, page, pgoff + offset,
> -				btrfs_submit_data_read_bio);
> -		if (!ret) {
> -			/*
> -			 * We have submitted the read repair, the page release
> -			 * will be handled by the endio function of the
> -			 * submitted repair bio.
> -			 * Thus we don't need to do any thing here.
> -			 */
> -			continue;
> -		}
> -		/*
> -		 * Continue on failed repair, otherwise the remaining sectors
> -		 * will not be properly unlocked.
> -		 */
> -next:
> -		end_sector_io(page, start + offset, uptodate);
> -	}
> -}
> -
>   /* lots and lots of room for performance fixes in the end_bio funcs */
>
>   void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
> @@ -2835,7 +2444,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
>   {
>   	struct bio *bio = &bbio->bio;
>   	struct bio_vec *bvec;
> -	struct extent_io_tree *tree, *failure_tree;
>   	struct processed_extent processed = { 0 };
>   	/*
>   	 * The offset to the beginning of a bio, since one bio can never be
> @@ -2852,8 +2460,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
>   		struct inode *inode = page->mapping->host;
>   		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   		const u32 sectorsize = fs_info->sectorsize;
> -		unsigned int error_bitmap = (unsigned int)-1;
> -		bool repair = false;
>   		u64 start;
>   		u64 end;
>   		u32 len;
> @@ -2862,8 +2468,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
>   			"end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u",
>   			bio->bi_iter.bi_sector, bio->bi_status,
>   			bbio->mirror_num);
> -		tree = &BTRFS_I(inode)->io_tree;
> -		failure_tree = &BTRFS_I(inode)->io_failure_tree;
>
>   		/*
>   		 * We always issue full-sector reads, but if some block in a
> @@ -2887,27 +2491,15 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
>   		len = bvec->bv_len;
>
>   		mirror = bbio->mirror_num;
> -		if (likely(uptodate)) {
> -			if (is_data_inode(inode)) {
> -				error_bitmap = btrfs_verify_data_csum(bbio,
> -						bio_offset, page, start, end);
> -				if (error_bitmap)
> -					uptodate = false;
> -			} else {
> -				if (btrfs_validate_metadata_buffer(bbio,
> -						page, start, end, mirror))
> -					uptodate = false;
> -			}
> -		}
> +		if (uptodate && !is_data_inode(inode) &&
> +		    btrfs_validate_metadata_buffer(bbio, page, start, end,
> +						   mirror))
> +			uptodate = false;
>
>   		if (likely(uptodate)) {
>   			loff_t i_size = i_size_read(inode);
>   			pgoff_t end_index = i_size >> PAGE_SHIFT;
>
> -			clean_io_failure(BTRFS_I(inode)->root->fs_info,
> -					 failure_tree, tree, start, page,
> -					 btrfs_ino(BTRFS_I(inode)), 0);
> -
>   			/*
>   			 * Zero out the remaining part if this range straddles
>   			 * i_size.
> @@ -2924,19 +2516,7 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
>   				zero_user_segment(page, zero_start,
>   						  offset_in_page(end) + 1);
>   			}
> -		} else if (is_data_inode(inode)) {
> -			/*
> -			 * Only try to repair bios that actually made it to a
> -			 * device.  If the bio failed to be submitted mirror
> -			 * is 0 and we need to fail it without retrying.
> -			 *
> -			 * This also includes the high level bios for compressed
> -			 * extents - these never make it to a device and repair
> -			 * is already handled on the lower compressed bio.
> -			 */
> -			if (mirror > 0)
> -				repair = true;
> -		} else {
> +		} else if (!is_data_inode(inode)) {
>   			struct extent_buffer *eb;
>
>   			eb = find_extent_buffer_readpage(fs_info, page, start);
> @@ -2945,19 +2525,10 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
>   			atomic_dec(&eb->io_pages);
>   		}
>
> -		if (repair) {
> -			/*
> -			 * submit_data_read_repair() will handle all the good
> -			 * and bad sectors, we just continue to the next bvec.
> -			 */
> -			submit_data_read_repair(inode, bbio, bio_offset, bvec,
> -						error_bitmap);
> -		} else {
> -			/* Update page status and unlock */
> -			end_page_read(page, uptodate, start, len);
> -			endio_readpage_release_extent(&processed, BTRFS_I(inode),
> -					start, end, PageUptodate(page));
> -		}
> +		/* Update page status and unlock */
> +		end_page_read(page, uptodate, start, len);
> +		endio_readpage_release_extent(&processed, BTRFS_I(inode),
> +				start, end, PageUptodate(page));
>
>   		ASSERT(bio_offset + len > bio_offset);
>   		bio_offset += len;
> @@ -2965,7 +2536,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
>   	}
>   	/* Release the last extent */
>   	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
> -	btrfs_bio_free_csum(bbio);
>   	bio_put(bio);
>   }
>
> @@ -3158,7 +2728,8 @@ static int alloc_new_bio(struct btrfs_inode *inode,
>   	struct bio *bio;
>   	int ret;
>
> -	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, end_io_func, NULL);
> +	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, &inode->vfs_inode, end_io_func,
> +			      NULL);
>   	/*
>   	 * For compressed page range, its disk_bytenr is always @disk_bytenr
>   	 * passed in, no matter if we have added any range into previous bio.
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index e653e64598bf7..caf3343d1a36c 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -57,17 +57,11 @@ enum {
>   #define BITMAP_LAST_BYTE_MASK(nbits) \
>   	(BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
>
> -struct btrfs_bio;
>   struct btrfs_root;
>   struct btrfs_inode;
>   struct btrfs_fs_info;
> -struct io_failure_record;
>   struct extent_io_tree;
>
> -typedef void (submit_bio_hook_t)(struct inode *inode, struct bio *bio,
> -					 int mirror_num,
> -					 enum btrfs_compression_type compress_type);
> -
>   typedef blk_status_t (extent_submit_bio_start_t)(struct inode *inode,
>   		struct bio *bio, u64 dio_file_offset);
>
> @@ -244,28 +238,6 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array);
>
>   void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
>
> -/*
> - * When IO fails, either with EIO or csum verification fails, we
> - * try other mirrors that might have a good copy of the data.  This
> - * io_failure_record is used to record state as we go through all the
> - * mirrors.  If another mirror has good data, the sector is set up to date
> - * and things continue.  If a good mirror can't be found, the original
> - * bio end_io callback is called to indicate things have failed.
> - */
> -struct io_failure_record {
> -	struct page *page;
> -	u64 start;
> -	u64 len;
> -	u64 logical;
> -	int this_mirror;
> -	int failed_mirror;
> -	int num_copies;
> -};
> -
> -int btrfs_repair_one_sector(struct inode *inode, struct btrfs_bio *failed_bbio,
> -			    u32 bio_offset, struct page *page, unsigned int pgoff,
> -			    submit_bio_hook_t *submit_bio_hook);
> -
>   #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>   bool find_lock_delalloc_range(struct inode *inode,
>   			     struct page *locked_page, u64 *start,
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 29999686d234c..ffbac8f257908 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -359,27 +359,27 @@ static int search_file_offset_in_bio(struct bio *bio, struct inode *inode,
>    *       NULL, the checksum buffer is allocated and returned in
>    *       btrfs_bio(bio)->csum instead.
>    *
> - * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
> + * Return: -ENOMEM if allocating memory fails, 0 otherwise.
>    */
> -blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst)
> +int btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
>   {
> +	struct inode *inode = bbio->inode;
>   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> -	struct btrfs_bio *bbio = NULL;
> +	struct bio *bio = &bbio->bio;
>   	struct btrfs_path *path;
>   	const u32 sectorsize = fs_info->sectorsize;
>   	const u32 csum_size = fs_info->csum_size;
>   	u32 orig_len = bio->bi_iter.bi_size;
>   	u64 orig_disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>   	u64 cur_disk_bytenr;
> -	u8 *csum;
>   	const unsigned int nblocks = orig_len >> fs_info->sectorsize_bits;
>   	int count = 0;
> -	blk_status_t ret = BLK_STS_OK;
> +	int ret = 0;
>
>   	if ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ||
>   	    test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state))
> -		return BLK_STS_OK;
> +		return 0;
>
>   	/*
>   	 * This function is only called for read bio.
> @@ -396,23 +396,16 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   	ASSERT(bio_op(bio) == REQ_OP_READ);
>   	path = btrfs_alloc_path();
>   	if (!path)
> -		return BLK_STS_RESOURCE;
> -
> -	if (!dst) {
> -		bbio = btrfs_bio(bio);
> +		return -ENOMEM;
>
> -		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
> -			bbio->csum = kmalloc_array(nblocks, csum_size, GFP_NOFS);
> -			if (!bbio->csum) {
> -				btrfs_free_path(path);
> -				return BLK_STS_RESOURCE;
> -			}
> -		} else {
> -			bbio->csum = bbio->csum_inline;
> +	if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
> +		bbio->csum = kmalloc_array(nblocks, csum_size, GFP_NOFS);
> +		if (!bbio->csum) {
> +			btrfs_free_path(path);
> +			return -ENOMEM;
>   		}
> -		csum = bbio->csum;
>   	} else {
> -		csum = dst;
> +		bbio->csum = bbio->csum_inline;
>   	}
>
>   	/*
> @@ -451,14 +444,15 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
>   		ASSERT(cur_disk_bytenr - orig_disk_bytenr < UINT_MAX);
>   		sector_offset = (cur_disk_bytenr - orig_disk_bytenr) >>
>   				fs_info->sectorsize_bits;
> -		csum_dst = csum + sector_offset * csum_size;
> +		csum_dst = bbio->csum + sector_offset * csum_size;
>
>   		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
>   					 search_len, csum_dst);
>   		if (count < 0) {
> -			ret = errno_to_blk_status(count);
> -			if (bbio)
> -				btrfs_bio_free_csum(bbio);
> +			ret = count;
> +			if (bbio->csum != bbio->csum_inline)
> +				kfree(bbio->csum);
> +			bbio->csum = NULL;
>   			break;
>   		}
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b9d40e25d978c..b3466015008c7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -85,9 +85,6 @@ struct btrfs_dio_private {
>   	 */
>   	refcount_t refs;
>
> -	/* Array of checksums */
> -	u8 *csums;
> -
>   	/* This must be last */
>   	struct bio bio;
>   };
> @@ -2735,9 +2732,6 @@ void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirro
>   void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
>   			int mirror_num, enum btrfs_compression_type compress_type)
>   {
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	blk_status_t ret;
> -
>   	if (compress_type != BTRFS_COMPRESS_NONE) {
>   		/*
>   		 * btrfs_submit_compressed_read will handle completing the bio
> @@ -2747,20 +2741,7 @@ void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
>   		return;
>   	}
>
> -	/* Save the original iter for read repair */
> -	btrfs_bio(bio)->iter = bio->bi_iter;
> -
> -	/*
> -	 * Lookup bio sums does extra checks around whether we need to csum or
> -	 * not, which is why we ignore skip_sum here.
> -	 */
> -	ret = btrfs_lookup_bio_sums(inode, bio, NULL);
> -	if (ret) {
> -		btrfs_bio_end_io(btrfs_bio(bio), ret);
> -		return;
> -	}
> -
> -	btrfs_submit_bio(fs_info, bio, mirror_num);
> +	btrfs_submit_bio(btrfs_sb(inode->i_sb), bio, mirror_num);
>   }
>
>   /*
> @@ -3238,8 +3219,6 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
>   					ordered_extent->disk_num_bytes);
>   	}
>
> -	btrfs_free_io_failure_record(inode, start, end);
> -
>   	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
>   		truncated = true;
>   		logical_len = ordered_extent->truncated_len;
> @@ -3417,133 +3396,64 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>   }
>
>   /*
> - * Verify the checksum for a single sector without any extra action that depend
> - * on the type of I/O.
> + * btrfs_data_csum_ok - verify the checksum of single data sector
> + * @bbio:	btrfs_io_bio which contains the csum
> + * @dev:	device the sector is on
> + * @bio_offset:	offset to the beginning of the bio (in bytes)
> + * @bv:		bio_vec to check
> + *
> + * Check if the checksum on a data block is valid.  When a checksum mismatch is
> + * detected, report the error and fill the corrupted range with zero.
> + *
> + * Return %true if the sector is ok or had no checksum to start with, else
> + * %false.
>    */
> -int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
> -			    u32 pgoff, u8 *csum, const u8 * const csum_expected)
> +bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
> +			u32 bio_offset, struct bio_vec *bv)
>   {
> +	struct btrfs_fs_info *fs_info = btrfs_sb(bbio->inode->i_sb);
> +	struct btrfs_inode *bi = BTRFS_I(bbio->inode);
> +	u64 file_offset = bbio->file_offset + bio_offset;
> +	u64 end = file_offset + bv->bv_len - 1;
>   	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
> +	u8 *csum_expected;
> +	u8 csum[BTRFS_CSUM_SIZE];
>   	char *kaddr;
>
> -	ASSERT(pgoff + fs_info->sectorsize <= PAGE_SIZE);
> +	ASSERT(bv->bv_len == fs_info->sectorsize);
> +
> +	if (!bbio->csum)
> +		return true;
> +
> +	if (btrfs_is_data_reloc_root(bi->root) &&
> +	    test_range_bit(&bi->io_tree, file_offset, end, EXTENT_NODATASUM,
> +			1, NULL)) {
> +		/* Skip the range without csum for data reloc inode */
> +		clear_extent_bits(&bi->io_tree, file_offset, end,
> +				  EXTENT_NODATASUM);
> +		return true;
> +	}
> +
> +	csum_expected = btrfs_csum_ptr(fs_info, bbio->csum, bio_offset);
>
>   	shash->tfm = fs_info->csum_shash;
>
> -	kaddr = kmap_local_page(page) + pgoff;
> +	kaddr = bvec_kmap_local(bv);
>   	crypto_shash_digest(shash, kaddr, fs_info->sectorsize, csum);
>   	kunmap_local(kaddr);
>
>   	if (memcmp(csum, csum_expected, fs_info->csum_size))
> -		return -EIO;
> -	return 0;
> -}
> -
> -/*
> - * check_data_csum - verify checksum of one sector of uncompressed data
> - * @inode:	inode
> - * @bbio:	btrfs_bio which contains the csum
> - * @bio_offset:	offset to the beginning of the bio (in bytes)
> - * @page:	page where is the data to be verified
> - * @pgoff:	offset inside the page
> - *
> - * The length of such check is always one sector size.
> - *
> - * When csum mismatch is detected, we will also report the error and fill the
> - * corrupted range with zero. (Thus it needs the extra parameters)
> - */
> -int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
> -			  u32 bio_offset, struct page *page, u32 pgoff)
> -{
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	u32 len = fs_info->sectorsize;
> -	u8 *csum_expected;
> -	u8 csum[BTRFS_CSUM_SIZE];
> -
> -	ASSERT(pgoff + len <= PAGE_SIZE);
> -
> -	csum_expected = btrfs_csum_ptr(fs_info, bbio->csum, bio_offset);
> -
> -	if (btrfs_check_sector_csum(fs_info, page, pgoff, csum, csum_expected))
>   		goto zeroit;
> -	return 0;
> +	return true;
>
>   zeroit:
> -	btrfs_print_data_csum_error(BTRFS_I(inode),
> -				    bbio->file_offset + bio_offset,
> -				    csum, csum_expected, bbio->mirror_num);
> -	if (bbio->device)
> -		btrfs_dev_stat_inc_and_print(bbio->device,
> +	btrfs_print_data_csum_error(BTRFS_I(bbio->inode), file_offset, csum,
> +				    csum_expected, bbio->mirror_num);
> +	if (dev)
> +		btrfs_dev_stat_inc_and_print(dev,
>   					     BTRFS_DEV_STAT_CORRUPTION_ERRS);
> -	memzero_page(page, pgoff, len);
> -	return -EIO;
> -}
> -
> -/*
> - * When reads are done, we need to check csums to verify the data is correct.
> - * if there's a match, we allow the bio to finish.  If not, the code in
> - * extent_io.c will try to find good copies for us.
> - *
> - * @bio_offset:	offset to the beginning of the bio (in bytes)
> - * @start:	file offset of the range start
> - * @end:	file offset of the range end (inclusive)
> - *
> - * Return a bitmap where bit set means a csum mismatch, and bit not set means
> - * csum match.
> - */
> -unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
> -				    u32 bio_offset, struct page *page,
> -				    u64 start, u64 end)
> -{
> -	struct inode *inode = page->mapping->host;
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> -	struct btrfs_root *root = BTRFS_I(inode)->root;
> -	const u32 sectorsize = root->fs_info->sectorsize;
> -	u32 pg_off;
> -	unsigned int result = 0;
> -
> -	/*
> -	 * This only happens for NODATASUM or compressed read.
> -	 * Normally this should be covered by above check for compressed read
> -	 * or the next check for NODATASUM.  Just do a quicker exit here.
> -	 */
> -	if (bbio->csum == NULL)
> -		return 0;
> -
> -	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
> -		return 0;
> -
> -	if (unlikely(test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state)))
> -		return 0;
> -
> -	ASSERT(page_offset(page) <= start &&
> -	       end <= page_offset(page) + PAGE_SIZE - 1);
> -	for (pg_off = offset_in_page(start);
> -	     pg_off < offset_in_page(end);
> -	     pg_off += sectorsize, bio_offset += sectorsize) {
> -		u64 file_offset = pg_off + page_offset(page);
> -		int ret;
> -
> -		if (btrfs_is_data_reloc_root(root) &&
> -		    test_range_bit(io_tree, file_offset,
> -				   file_offset + sectorsize - 1,
> -				   EXTENT_NODATASUM, 1, NULL)) {
> -			/* Skip the range without csum for data reloc inode */
> -			clear_extent_bits(io_tree, file_offset,
> -					  file_offset + sectorsize - 1,
> -					  EXTENT_NODATASUM);
> -			continue;
> -		}
> -		ret = btrfs_check_data_csum(inode, bbio, bio_offset, page, pg_off);
> -		if (ret < 0) {
> -			const int nr_bit = (pg_off - offset_in_page(start)) >>
> -				     root->fs_info->sectorsize_bits;
> -
> -			result |= (1U << nr_bit);
> -		}
> -	}
> -	return result;
> +	memzero_bvec(bv);
> +	return false;
>   }
>
>   /*
> @@ -5437,8 +5347,6 @@ void btrfs_evict_inode(struct inode *inode)
>   	if (is_bad_inode(inode))
>   		goto no_delete;
>
> -	btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
> -
>   	if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
>   		goto no_delete;
>
> @@ -7974,60 +7882,9 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
>   			      dip->file_offset + dip->bytes - 1);
>   	}
>
> -	kfree(dip->csums);
>   	bio_endio(&dip->bio);
>   }
>
> -static void submit_dio_repair_bio(struct inode *inode, struct bio *bio,
> -				  int mirror_num,
> -				  enum btrfs_compression_type compress_type)
> -{
> -	struct btrfs_dio_private *dip = btrfs_bio(bio)->private;
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -
> -	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
> -
> -	refcount_inc(&dip->refs);
> -	btrfs_submit_bio(fs_info, bio, mirror_num);
> -}
> -
> -static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
> -					     struct btrfs_bio *bbio,
> -					     const bool uptodate)
> -{
> -	struct inode *inode = dip->inode;
> -	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> -	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
> -	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> -	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
> -	blk_status_t err = BLK_STS_OK;
> -	struct bvec_iter iter;
> -	struct bio_vec bv;
> -	u32 offset;
> -
> -	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
> -		u64 start = bbio->file_offset + offset;
> -
> -		if (uptodate &&
> -		    (!csum || !btrfs_check_data_csum(inode, bbio, offset, bv.bv_page,
> -					       bv.bv_offset))) {
> -			clean_io_failure(fs_info, failure_tree, io_tree, start,
> -					 bv.bv_page, btrfs_ino(BTRFS_I(inode)),
> -					 bv.bv_offset);
> -		} else {
> -			int ret;
> -
> -			ret = btrfs_repair_one_sector(inode, bbio, offset,
> -					bv.bv_page, bv.bv_offset,
> -					submit_dio_repair_bio);
> -			if (ret)
> -				err = errno_to_blk_status(ret);
> -		}
> -	}
> -
> -	return err;
> -}
> -
>   static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
>   						     struct bio *bio,
>   						     u64 dio_file_offset)
> @@ -8041,18 +7898,14 @@ static void btrfs_end_dio_bio(struct btrfs_bio *bbio)
>   	struct bio *bio = &bbio->bio;
>   	blk_status_t err = bio->bi_status;
>
> -	if (err)
> +	if (err) {
>   		btrfs_warn(BTRFS_I(dip->inode)->root->fs_info,
>   			   "direct IO failed ino %llu rw %d,%u sector %#Lx len %u err no %d",
>   			   btrfs_ino(BTRFS_I(dip->inode)), bio_op(bio),
>   			   bio->bi_opf, bio->bi_iter.bi_sector,
>   			   bio->bi_iter.bi_size, err);
> -
> -	if (bio_op(bio) == REQ_OP_READ)
> -		err = btrfs_check_read_dio_bio(dip, bbio, !err);
> -
> -	if (err)
>   		dip->bio.bi_status = err;
> +	}
>
>   	btrfs_record_physical_zoned(dip->inode, bbio->file_offset, bio);
>
> @@ -8064,13 +7917,8 @@ static void btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
>   				 u64 file_offset, int async_submit)
>   {
>   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	struct btrfs_dio_private *dip = btrfs_bio(bio)->private;
>   	blk_status_t ret;
> -
> -	/* Save the original iter for read repair */
> -	if (btrfs_op(bio) == BTRFS_MAP_READ)
> -		btrfs_bio(bio)->iter = bio->bi_iter;
> -
> +
>   	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
>   		goto map;
>
> @@ -8090,9 +7938,6 @@ static void btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
>   			btrfs_bio_end_io(btrfs_bio(bio), ret);
>   			return;
>   		}
> -	} else {
> -		btrfs_bio(bio)->csum = btrfs_csum_ptr(fs_info, dip->csums,
> -						      file_offset - dip->file_offset);
>   	}
>   map:
>   	btrfs_submit_bio(fs_info, bio, 0);
> @@ -8104,7 +7949,6 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
>   	struct btrfs_dio_private *dip =
>   		container_of(dio_bio, struct btrfs_dio_private, bio);
>   	struct inode *inode = iter->inode;
> -	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
>   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   	const bool raid56 = (btrfs_data_alloc_profile(fs_info) &
>   			     BTRFS_BLOCK_GROUP_RAID56_MASK);
> @@ -8125,25 +7969,6 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
>   	dip->file_offset = file_offset;
>   	dip->bytes = dio_bio->bi_iter.bi_size;
>   	refcount_set(&dip->refs, 1);
> -	dip->csums = NULL;
> -
> -	if (!write && !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> -		unsigned int nr_sectors =
> -			(dio_bio->bi_iter.bi_size >> fs_info->sectorsize_bits);
> -
> -		/*
> -		 * Load the csums up front to reduce csum tree searches and
> -		 * contention when submitting bios.
> -		 */
> -		status = BLK_STS_RESOURCE;
> -		dip->csums = kcalloc(nr_sectors, fs_info->csum_size, GFP_NOFS);
> -		if (!dip)
> -			goto out_err;
> -
> -		status = btrfs_lookup_bio_sums(inode, dio_bio, dip->csums);
> -		if (status != BLK_STS_OK)
> -			goto out_err;
> -	}
>
>   	start_sector = dio_bio->bi_iter.bi_sector;
>   	submit_len = dio_bio->bi_iter.bi_size;
> @@ -8171,7 +7996,7 @@ static void btrfs_submit_direct(const struct iomap_iter *iter,
>   		 * the allocation is backed by btrfs_bioset.
>   		 */
>   		bio = btrfs_bio_clone_partial(dio_bio, clone_offset, clone_len,
> -					      btrfs_end_dio_bio, dip);
> +					      inode, btrfs_end_dio_bio, dip);
>   		btrfs_bio(bio)->file_offset = file_offset;
>
>   		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> @@ -8918,12 +8743,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>   	inode = &ei->vfs_inode;
>   	extent_map_tree_init(&ei->extent_tree);
>   	extent_io_tree_init(fs_info, &ei->io_tree, IO_TREE_INODE_IO, inode);
> -	extent_io_tree_init(fs_info, &ei->io_failure_tree,
> -			    IO_TREE_INODE_IO_FAILURE, inode);
>   	extent_io_tree_init(fs_info, &ei->file_extent_tree,
>   			    IO_TREE_INODE_FILE_EXTENT, inode);
>   	ei->io_tree.track_uptodate = true;
> -	ei->io_failure_tree.track_uptodate = true;
>   	atomic_set(&ei->sync_writers, 0);
>   	mutex_init(&ei->log_mutex);
>   	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
> @@ -10370,7 +10192,6 @@ struct btrfs_encoded_read_private {
>   	wait_queue_head_t wait;
>   	atomic_t pending;
>   	blk_status_t status;
> -	bool skip_csum;
>   };
>
>   static blk_status_t submit_encoded_read_bio(struct btrfs_inode *inode,
> @@ -10378,57 +10199,17 @@ static blk_status_t submit_encoded_read_bio(struct btrfs_inode *inode,
>   {
>   	struct btrfs_encoded_read_private *priv = btrfs_bio(bio)->private;
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	blk_status_t ret;
> -
> -	if (!priv->skip_csum) {
> -		ret = btrfs_lookup_bio_sums(&inode->vfs_inode, bio, NULL);
> -		if (ret)
> -			return ret;
> -	}
>
>   	atomic_inc(&priv->pending);
>   	btrfs_submit_bio(fs_info, bio, mirror_num);
>   	return BLK_STS_OK;
>   }
>
> -static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
> -{
> -	const bool uptodate = (bbio->bio.bi_status == BLK_STS_OK);
> -	struct btrfs_encoded_read_private *priv = bbio->private;
> -	struct btrfs_inode *inode = priv->inode;
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	u32 sectorsize = fs_info->sectorsize;
> -	struct bio_vec *bvec;
> -	struct bvec_iter_all iter_all;
> -	u32 bio_offset = 0;
> -
> -	if (priv->skip_csum || !uptodate)
> -		return bbio->bio.bi_status;
> -
> -	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
> -		unsigned int i, nr_sectors, pgoff;
> -
> -		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> -		pgoff = bvec->bv_offset;
> -		for (i = 0; i < nr_sectors; i++) {
> -			ASSERT(pgoff < PAGE_SIZE);
> -			if (btrfs_check_data_csum(&inode->vfs_inode, bbio, bio_offset,
> -					    bvec->bv_page, pgoff))
> -				return BLK_STS_IOERR;
> -			bio_offset += sectorsize;
> -			pgoff += sectorsize;
> -		}
> -	}
> -	return BLK_STS_OK;
> -}
> -
>   static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>   {
>   	struct btrfs_encoded_read_private *priv = bbio->private;
> -	blk_status_t status;
>
> -	status = btrfs_encoded_read_verify_csum(bbio);
> -	if (status) {
> +	if (bbio->bio.bi_status) {
>   		/*
>   		 * The memory barrier implied by the atomic_dec_return() here
>   		 * pairs with the memory barrier implied by the
> @@ -10437,11 +10218,10 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>   		 * write is observed before the load of status in
>   		 * btrfs_encoded_read_regular_fill_pages().
>   		 */
> -		WRITE_ONCE(priv->status, status);
> +		WRITE_ONCE(priv->status, bbio->bio.bi_status);
>   	}
>   	if (!atomic_dec_return(&priv->pending))
>   		wake_up(&priv->wait);
> -	btrfs_bio_free_csum(bbio);
>   	bio_put(&bbio->bio);
>   }
>
> @@ -10454,7 +10234,6 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>   		.inode = inode,
>   		.file_offset = file_offset,
>   		.pending = ATOMIC_INIT(1),
> -		.skip_csum = (inode->flags & BTRFS_INODE_NODATASUM),
>   	};
>   	unsigned long i = 0;
>   	u64 cur = 0;
> @@ -10490,6 +10269,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>
>   			if (!bio) {
>   				bio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ,
> +						      &inode->vfs_inode,
>   						      btrfs_encoded_read_endio,
>   						      &priv);
>   				bio->bi_iter.bi_sector =
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index dff735e36da96..b8472ab466abe 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -35,6 +35,14 @@
>   #include "zoned.h"
>
>   static struct bio_set btrfs_bioset;
> +static struct bio_set btrfs_repair_bioset;
> +static mempool_t btrfs_failed_bio_pool;
> +
> +struct btrfs_failed_bio {
> +	struct btrfs_bio *bbio;
> +	int num_copies;
> +	atomic_t repair_count;
> +};
>
>   #define BTRFS_BLOCK_GROUP_STRIPE_MASK	(BTRFS_BLOCK_GROUP_RAID0 | \
>   					 BTRFS_BLOCK_GROUP_RAID10 | \
> @@ -6646,10 +6654,11 @@ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>    * Initialize a btrfs_bio structure.  This skips the embedded bio itself as it
>    * is already initialized by the block layer.
>    */
> -static inline void btrfs_bio_init(struct btrfs_bio *bbio,
> -				  btrfs_bio_end_io_t end_io, void *private)
> +static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
> +			   btrfs_bio_end_io_t end_io, void *private)
>   {
>   	memset(bbio, 0, offsetof(struct btrfs_bio, bio));
> +	bbio->inode = inode;
>   	bbio->end_io = end_io;
>   	bbio->private = private;
>   }
> @@ -6662,16 +6671,18 @@ static inline void btrfs_bio_init(struct btrfs_bio *bbio,
>    * a mempool.
>    */
>   struct bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
> -			    btrfs_bio_end_io_t end_io, void *private)
> +			    struct inode *inode, btrfs_bio_end_io_t end_io,
> +			    void *private)
>   {
>   	struct bio *bio;
>
>   	bio = bio_alloc_bioset(NULL, nr_vecs, opf, GFP_NOFS, &btrfs_bioset);
> -	btrfs_bio_init(btrfs_bio(bio), end_io, private);
> +	btrfs_bio_init(btrfs_bio(bio), inode, end_io, private);
>   	return bio;
>   }
>
>   struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
> +				    struct inode *inode,
>   				    btrfs_bio_end_io_t end_io, void *private)
>   {
>   	struct bio *bio;
> @@ -6681,13 +6692,174 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
>
>   	bio = bio_alloc_clone(orig->bi_bdev, orig, GFP_NOFS, &btrfs_bioset);
>   	bbio = btrfs_bio(bio);
> -	btrfs_bio_init(bbio, end_io, private);
> +	btrfs_bio_init(bbio, inode, end_io, private);
>
>   	bio_trim(bio, offset >> 9, size >> 9);
> -	bbio->iter = bio->bi_iter;
>   	return bio;
>   }
>
> +static int next_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
> +{
> +	if (cur_mirror == fbio->num_copies)
> +		return cur_mirror + 1 - fbio->num_copies;
> +	return cur_mirror + 1;
> +}
> +
> +static int prev_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
> +{
> +	if (cur_mirror == 1)
> +		return fbio->num_copies;
> +	return cur_mirror - 1;
> +}
> +
> +static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
> +{
> +	if (atomic_dec_and_test(&fbio->repair_count)) {
> +		fbio->bbio->end_io(fbio->bbio);
> +		mempool_free(fbio, &btrfs_failed_bio_pool);
> +	}
> +}
> +
> +static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
> +				 struct btrfs_device *dev)
> +{
> +	struct btrfs_failed_bio *fbio = repair_bbio->private;
> +	struct inode *inode = repair_bbio->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
> +	int mirror = repair_bbio->mirror_num;
> +
> +	if (repair_bbio->bio.bi_status ||
> +	    !btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
> +		bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
> +		repair_bbio->bio.bi_iter = repair_bbio->saved_iter;
> +
> +		mirror = next_repair_mirror(fbio, mirror);
> +		if (mirror == fbio->bbio->mirror_num) {
> +			btrfs_debug(fs_info, "no mirror left");
> +			fbio->bbio->bio.bi_status = BLK_STS_IOERR;
> +			goto done;
> +		}
> +
> +		btrfs_submit_bio(fs_info, &repair_bbio->bio, mirror);
> +		return;
> +	}
> +
> +	do {
> +		mirror = prev_repair_mirror(fbio, mirror);
> +		btrfs_repair_io_failure(fs_info, btrfs_ino(BTRFS_I(inode)),
> +				  repair_bbio->file_offset, fs_info->sectorsize,
> +				  repair_bbio->saved_iter.bi_sector <<
> +					SECTOR_SHIFT,
> +				  bv->bv_page, bv->bv_offset, mirror);
> +	} while (mirror != fbio->bbio->mirror_num);
> +
> +done:
> +	btrfs_repair_done(fbio);
> +	bio_put(&repair_bbio->bio);
> +}
> +
> +/*
> + * Try to kick off a repair read to the next available mirror for a bad
> + * sector.
> + *
> + * This primarily tries to recover good data to serve the actual read request,
> + * but also tries to write the good data back to the bad mirror(s) when a
> + * read succeeded to restore the redundancy.
> + */
> +static void repair_one_sector(struct btrfs_bio *failed_bbio, u32 bio_offset,
> +			      struct bio_vec *bv,
> +			      struct btrfs_failed_bio **fbio)
> +{
> +	struct inode *inode = failed_bbio->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	const u32 sectorsize = fs_info->sectorsize;
> +	const u64 logical = failed_bbio->saved_iter.bi_sector << SECTOR_SHIFT;
> +	struct btrfs_bio *repair_bbio;
> +	struct bio *repair_bio;
> +	int num_copies;
> +	int mirror;
> +
> +	btrfs_debug(fs_info, "repair read error: read error at %llu",
> +		    failed_bbio->file_offset + bio_offset);
> +
> +	num_copies = btrfs_num_copies(fs_info, logical, sectorsize);
> +	if (num_copies == 1) {
> +		btrfs_debug(fs_info, "no copy to repair from");
> +		failed_bbio->bio.bi_status = BLK_STS_IOERR;
> +		return;
> +	}
> +
> +	if (!*fbio) {
> +		*fbio = mempool_alloc(&btrfs_failed_bio_pool, GFP_NOFS);
> +		(*fbio)->bbio = failed_bbio;
> +		(*fbio)->num_copies = num_copies;
> +		atomic_set(&(*fbio)->repair_count, 1);
> +	}
> +
> +	atomic_inc(&(*fbio)->repair_count);
> +
> +	repair_bio = bio_alloc_bioset(NULL, 1, REQ_OP_READ, GFP_NOFS,
> +				      &btrfs_repair_bioset);
> +	repair_bio->bi_iter.bi_sector = failed_bbio->saved_iter.bi_sector;
> +	bio_add_page(repair_bio, bv->bv_page, bv->bv_len, bv->bv_offset);
> +
> +	repair_bbio = btrfs_bio(repair_bio);
> +	btrfs_bio_init(repair_bbio, failed_bbio->inode, NULL, *fbio);
> +	repair_bbio->file_offset = failed_bbio->file_offset + bio_offset;
> +
> +	mirror = next_repair_mirror(*fbio, failed_bbio->mirror_num);
> +	btrfs_debug(fs_info, "submitting repair read to mirror %d", mirror);
> +	btrfs_submit_bio(fs_info, repair_bio, mirror);
> +}
> +
> +static void btrfs_check_read_bio(struct btrfs_bio *bbio,
> +				 struct btrfs_device *dev)
> +{
> +	struct inode *inode = bbio->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	unsigned int sectorsize = fs_info->sectorsize;
> +	struct bvec_iter *iter = &bbio->saved_iter;
> +	blk_status_t status = bbio->bio.bi_status;
> +	struct btrfs_failed_bio *fbio = NULL;
> +	u32 offset = 0;
> +
> +	/*
> +	 * Hand off repair bios to the repair code as there is no upper level
> +	 * submitter for them.
> +	 */
> +	if (unlikely(bbio->bio.bi_pool == &btrfs_repair_bioset)) {
> +		btrfs_end_repair_bio(bbio, dev);
> +		return;
> +	}
> +
> +	/* Metadata reads are checked and repaired by the submitter */
> +	if (bbio->bio.bi_opf & REQ_META)
> +		goto done;
> +
> +	/* Clear the I/O error.  A failed repair will reset it */
> +	bbio->bio.bi_status = BLK_STS_OK;
> +
> +	while (iter->bi_size) {
> +		struct bio_vec bv = bio_iter_iovec(&bbio->bio, *iter);
> +
> +		bv.bv_len = min(bv.bv_len, sectorsize);
> +		if (status || !btrfs_data_csum_ok(bbio, dev, offset, &bv))
> +			repair_one_sector(bbio, offset, &bv, &fbio);
> +
> +	     	bio_advance_iter_single(&bbio->bio, iter, sectorsize);
> +		offset += sectorsize;
> +	}
> +
> +	if (bbio->csum != bbio->csum_inline)
> +		kfree(bbio->csum);
> +done:
> +	if (unlikely(fbio))
> +		btrfs_repair_done(fbio);
> +	else
> +		bbio->end_io(bbio);
> +}
> +
>   static void btrfs_log_dev_io_error(struct bio *bio, struct btrfs_device *dev)
>   {
>   	if (!dev || !dev->bdev)
> @@ -6716,18 +6888,19 @@ static void btrfs_end_bio_work(struct work_struct *work)
>   	struct btrfs_bio *bbio =
>   		container_of(work, struct btrfs_bio, end_io_work);
>
> -	bbio->end_io(bbio);
> +	btrfs_check_read_bio(bbio, bbio->bio.bi_private);
>   }
>
>   static void btrfs_simple_end_io(struct bio *bio)
>   {
> -	struct btrfs_fs_info *fs_info = bio->bi_private;
>   	struct btrfs_bio *bbio = btrfs_bio(bio);
> +	struct btrfs_device *dev = bio->bi_private;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(bbio->inode->i_sb);
>
>   	btrfs_bio_counter_dec(fs_info);
>
>   	if (bio->bi_status)
> -		btrfs_log_dev_io_error(bio, bbio->device);
> +		btrfs_log_dev_io_error(bio, dev);
>
>   	if (bio_op(bio) == REQ_OP_READ) {
>   		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
> @@ -6744,7 +6917,10 @@ static void btrfs_raid56_end_io(struct bio *bio)
>
>   	btrfs_bio_counter_dec(bioc->fs_info);
>   	bbio->mirror_num = bioc->mirror_num;
> -	bbio->end_io(bbio);
> +	if (bio_op(bio) == REQ_OP_READ)
> +		btrfs_check_read_bio(bbio, NULL);
> +	else
> +		bbio->end_io(bbio);
>
>   	btrfs_put_bioc(bioc);
>   }
> @@ -6852,6 +7028,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
>
>   void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror_num)
>   {
> +	struct btrfs_bio *bbio = btrfs_bio(bio);
>   	u64 logical = bio->bi_iter.bi_sector << 9;
>   	u64 length = bio->bi_iter.bi_size;
>   	u64 map_length = length;
> @@ -6862,11 +7039,8 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror
>   	btrfs_bio_counter_inc_blocked(fs_info);
>   	ret = __btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
>   				&bioc, &smap, &mirror_num, 1);
> -	if (ret) {
> -		btrfs_bio_counter_dec(fs_info);
> -		btrfs_bio_end_io(btrfs_bio(bio), errno_to_blk_status(ret));
> -		return;
> -	}
> +	if (ret)
> +		goto fail;
>
>   	if (map_length < length) {
>   		btrfs_crit(fs_info,
> @@ -6875,12 +7049,22 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror
>   		BUG();
>   	}
>
> +	/*
> +	 * Save the iter for the end_io handler and preload the checksums for
> +	 * data reads.
> +	 */
> +	if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) {
> +		bbio->saved_iter = bio->bi_iter;
> +		ret = btrfs_lookup_bio_sums(bbio);
> +		if (ret)
> +			goto fail;
> +	}
> +
>   	if (!bioc) {
>   		/* Single mirror read/write fast path */
>   		btrfs_bio(bio)->mirror_num = mirror_num;
> -		btrfs_bio(bio)->device = smap.dev;
>   		bio->bi_iter.bi_sector = smap.physical >> SECTOR_SHIFT;
> -		bio->bi_private = fs_info;
> +		bio->bi_private = smap.dev;
>   		bio->bi_end_io = btrfs_simple_end_io;
>   		btrfs_submit_dev_bio(smap.dev, bio);
>   	} else if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
> @@ -6900,6 +7084,11 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, int mirror
>   		for (dev_nr = 0; dev_nr < total_devs; dev_nr++)
>   			btrfs_submit_mirrored_bio(bioc, dev_nr);
>   	}
> +
> +	return;
> +fail:
> +	btrfs_bio_counter_dec(fs_info);
> +	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
>   }
>
>   /*
> @@ -8499,10 +8688,25 @@ int __init btrfs_bioset_init(void)
>   			offsetof(struct btrfs_bio, bio),
>   			BIOSET_NEED_BVECS))
>   		return -ENOMEM;
> +	if (bioset_init(&btrfs_repair_bioset, BIO_POOL_SIZE,
> +			offsetof(struct btrfs_bio, bio),
> +			BIOSET_NEED_BVECS))
> +		goto out_free_bioset;
> +	if (mempool_init_kmalloc_pool(&btrfs_failed_bio_pool, BIO_POOL_SIZE,
> +				      sizeof(struct btrfs_failed_bio)))
> +		goto out_free_repair_bioset;
>   	return 0;
> +
> +out_free_repair_bioset:
> +	bioset_exit(&btrfs_repair_bioset);
> +out_free_bioset:
> +	bioset_exit(&btrfs_bioset);
> +	return -ENOMEM;
>   }
>
>   void __cold btrfs_bioset_exit(void)
>   {
> +	mempool_exit(&btrfs_failed_bio_pool);
> +	bioset_exit(&btrfs_repair_bioset);
>   	bioset_exit(&btrfs_bioset);
>   }
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index b368356fa78a1..58c4156caa736 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -364,27 +364,28 @@ struct btrfs_fs_devices {
>   typedef void (*btrfs_bio_end_io_t)(struct btrfs_bio *bbio);
>
>   /*
> - * Additional info to pass along bio.
> - *
> - * Mostly for btrfs specific features like csum and mirror_num.
> + * Highlevel btrfs I/O structure.  It is allocated by btrfs_bio_alloc and
> + * passed to btrfs_submit_bio for mapping to the physical devices.
>    */
>   struct btrfs_bio {
> -	unsigned int mirror_num;
> -
> -	/* for direct I/O */
> +	/* Inode and offset into it that this I/O operates on. */
> +	struct inode *inode;
>   	u64 file_offset;
>
> -	/* @device is for stripe IO submission. */
> -	struct btrfs_device *device;
> +	/*
> +	 * Checksumming and original I/O information for internal use in the
> +	 * btrfs_submit_bio machinery.
> +	 */
>   	u8 *csum;
>   	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> -	struct bvec_iter iter;
> +	struct bvec_iter saved_iter;
>
>   	/* End I/O information supplied to btrfs_bio_alloc */
>   	btrfs_bio_end_io_t end_io;
>   	void *private;
>
> -	/* For read end I/O handling */
> +	/* For internal use in read end I/O handling */
> +	unsigned int mirror_num;
>   	struct work_struct end_io_work;
>
>   	/*
> @@ -403,8 +404,10 @@ int __init btrfs_bioset_init(void);
>   void __cold btrfs_bioset_exit(void);
>
>   struct bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
> -			    btrfs_bio_end_io_t end_io, void *private);
> +			    struct inode *inode, btrfs_bio_end_io_t end_io,
> +			    void *private);
>   struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
> +				    struct inode *inode,
>   				    btrfs_bio_end_io_t end_io, void *private);
>
>   static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
> @@ -413,30 +416,6 @@ static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>   	bbio->end_io(bbio);
>   }
>
> -static inline void btrfs_bio_free_csum(struct btrfs_bio *bbio)
> -{
> -	if (bbio->csum != bbio->csum_inline) {
> -		kfree(bbio->csum);
> -		bbio->csum = NULL;
> -	}
> -}
> -
> -/*
> - * Iterate through a btrfs_bio (@bbio) on a per-sector basis.
> - *
> - * bvl        - struct bio_vec
> - * bbio       - struct btrfs_bio
> - * iters      - struct bvec_iter
> - * bio_offset - unsigned int
> - */
> -#define btrfs_bio_for_each_sector(fs_info, bvl, bbio, iter, bio_offset)	\
> -	for ((iter) = (bbio)->iter, (bio_offset) = 0;			\
> -	     (iter).bi_size &&					\
> -	     (((bvl) = bio_iter_iovec((&(bbio)->bio), (iter))), 1);	\
> -	     (bio_offset) += fs_info->sectorsize,			\
> -	     bio_advance_iter_single(&(bbio)->bio, &(iter),		\
> -	     (fs_info)->sectorsize))
> -
>   struct btrfs_io_stripe {
>   	struct btrfs_device *dev;
>   	union {
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index f8a4118b16574..ed50e81174bf4 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -84,7 +84,6 @@ struct raid56_bio_trace_info;
>   	EM( IO_TREE_FS_EXCLUDED_EXTENTS,  "EXCLUDED_EXTENTS")	    \
>   	EM( IO_TREE_BTREE_INODE_IO,	  "BTREE_INODE_IO")	    \
>   	EM( IO_TREE_INODE_IO,		  "INODE_IO")		    \
> -	EM( IO_TREE_INODE_IO_FAILURE,	  "INODE_IO_FAILURE")	    \
>   	EM( IO_TREE_RELOC_BLOCKS,	  "RELOC_BLOCKS")	    \
>   	EM( IO_TREE_TRANS_DIRTY_PAGES,	  "TRANS_DIRTY_PAGES")      \
>   	EM( IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES")   \

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/17] btrfs: handle recording of zoned writes in the storage layer
  2022-09-01  7:42 ` [PATCH 06/17] btrfs: handle recording of zoned writes " Christoph Hellwig
@ 2022-09-01  9:44   ` Johannes Thumshirn
  2022-09-07 20:36   ` Josef Bacik
  2022-09-12  6:11   ` Naohiro Aota
  2 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01  9:44 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios
  2022-09-01  7:42 ` [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios Christoph Hellwig
@ 2022-09-01  9:47   ` Johannes Thumshirn
  2022-09-07 20:55   ` Josef Bacik
  2022-09-12  0:20   ` Qu Wenruo
  2 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01  9:47 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O
  2022-09-01  7:42 ` [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O Christoph Hellwig
@ 2022-09-01  9:56   ` Johannes Thumshirn
  2022-09-05  6:49     ` Christoph Hellwig
  2022-09-07 21:07   ` Josef Bacik
  1 sibling, 1 reply; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01  9:56 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On 01.09.22 09:43, Christoph Hellwig wrote:
> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +		struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
> +		struct extent_map *em;
>  
> -	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> -		bio_set_dev(bio, em->map_lookup->stripes[0].dev->bdev);
> +		em = btrfs_get_chunk_map(fs_info, disk_bytenr,
> +					 fs_info->sectorsize);
> +		if (IS_ERR(em)) {
> +			bio_put(bio);
> +			return ERR_CAST(em);
> +		}

Please use btrfs_get_zoned_device() instead of open coding it.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 11/17] btrfs: remove stripe boundary calculation for encoded I/O
  2022-09-01  7:42 ` [PATCH 11/17] btrfs: remove stripe boundary calculation for encoded I/O Christoph Hellwig
@ 2022-09-01  9:58   ` Johannes Thumshirn
  2022-09-07 21:08   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01  9:58 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 13/17] btrfs: remove submit_encoded_read_bio
  2022-09-01  7:42 ` [PATCH 13/17] btrfs: remove submit_encoded_read_bio Christoph Hellwig
@ 2022-09-01 10:02   ` Johannes Thumshirn
  2022-09-07 21:11   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01 10:02 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 14/17] btrfs: remove now spurious bio submission helpers
  2022-09-01  7:42 ` [PATCH 14/17] btrfs: remove now spurious bio submission helpers Christoph Hellwig
@ 2022-09-01 10:14   ` Johannes Thumshirn
  2022-09-07 21:12   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01 10:14 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND
  2022-09-01  7:42 ` [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND Christoph Hellwig
@ 2022-09-01 10:46   ` Johannes Thumshirn
  2022-09-02  1:38   ` Damien Le Moal
  2022-09-07 21:18   ` Josef Bacik
  2 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01 10:46 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode
  2022-09-01  7:42 ` [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode Christoph Hellwig
@ 2022-09-01 11:28   ` Johannes Thumshirn
  2022-09-05  6:50     ` Christoph Hellwig
  2022-09-02  1:56   ` Damien Le Moal
  1 sibling, 1 reply; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-01 11:28 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On 01.09.22 09:43, Christoph Hellwig wrote:
> To be able to split a write into properly sized zone append commands,
> we need a queue_limits structure that contains the least common
> denominator suitable for all devices.
> 

This patch conflicts with Shinichiro's patch restoring functionality 
of the zone emulation mode.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND
  2022-09-01  7:42 ` [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND Christoph Hellwig
  2022-09-01 10:46   ` Johannes Thumshirn
@ 2022-09-02  1:38   ` Damien Le Moal
  2022-09-05  6:50     ` Christoph Hellwig
  2022-09-07 21:18   ` Josef Bacik
  2 siblings, 1 reply; 108+ messages in thread
From: Damien Le Moal @ 2022-09-02  1:38 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, Johannes Thumshirn, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On 9/1/22 16:42, Christoph Hellwig wrote:
> No users left now that btrfs takes REQ_OP_WRITE bios from iomap and
> splits and converts them to REQ_OP_ZONE_APPEND internally.

Hu... I wanted to use that for zonefs for doing ZONE APPEND with AIOs...
Need to revisit that code anyway, so fine for now.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/direct-io.c  | 10 ++--------
>  include/linux/iomap.h |  1 -
>  2 files changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 4eb559a16c9ed..9e883a9f80388 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -217,16 +217,10 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  {
>  	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
>  
> -	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> -		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
> +	if (!(dio->flags & IOMAP_DIO_WRITE))
>  		return REQ_OP_READ;
> -	}
> -
> -	if (iomap->flags & IOMAP_F_ZONE_APPEND)
> -		opflags |= REQ_OP_ZONE_APPEND;
> -	else
> -		opflags |= REQ_OP_WRITE;
>  
> +	opflags |= REQ_OP_WRITE;
>  	if (use_fua)
>  		opflags |= REQ_FUA;
>  	else
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 238a03087e17e..ee6d511ef29dd 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -55,7 +55,6 @@ struct vm_fault;
>  #define IOMAP_F_SHARED		0x04
>  #define IOMAP_F_MERGED		0x08
>  #define IOMAP_F_BUFFER_HEAD	0x10
> -#define IOMAP_F_ZONE_APPEND	0x20
>  
>  /*
>   * Flags set by the core iomap code during operations:

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-01  7:42 ` [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio Christoph Hellwig
@ 2022-09-02  1:46   ` Damien Le Moal
  2022-09-05  6:55     ` Christoph Hellwig
  2022-09-05 13:15   ` Johannes Thumshirn
  2022-09-07 21:17   ` Josef Bacik
  2 siblings, 1 reply; 108+ messages in thread
From: Damien Le Moal @ 2022-09-02  1:46 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 9/1/22 16:42, Christoph Hellwig wrote:
> The current btrfs zoned device support is a little cumbersome in the data
> I/O path as it requires the callers to not support more I/O than the
> supported ZONE_APPEND size by the underlying device.  This leads to a lot

Did you mean: "...as it requires the callers to not issue I/O larger than
the supported ZONE_APPEND size for the underlying device." ?
I think you do mean that :)

> of extra accounting.  Instead change btrfs_submit_bio so that it can take
> write bios of arbitrary size and form from the upper layers, and just
> split them internally to the ZONE_APPEND queue limits.  Then remove all
> the upper layer warts catering to limited write sized on zoned devices,
> including the extra refcount in the compressed_bio.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/compression.c | 112 ++++++++---------------------------------
>  fs/btrfs/compression.h |   3 --
>  fs/btrfs/extent_io.c   |  74 ++++++---------------------
>  fs/btrfs/inode.c       |   4 --
>  fs/btrfs/volumes.c     |  40 +++++++++------
>  fs/btrfs/zoned.c       |  20 --------
>  fs/btrfs/zoned.h       |   9 ----
>  7 files changed, 62 insertions(+), 200 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 5e8b75b030ace..f89cac08dc4a4 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -255,57 +255,14 @@ static void btrfs_finish_compressed_write_work(struct work_struct *work)
>  static void end_compressed_bio_write(struct btrfs_bio *bbio)
>  {
>  	struct compressed_bio *cb = bbio->private;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
>  
> -	if (bbio->bio.bi_status)
> -		cb->status = bbio->bio.bi_status;
> -
> -	if (refcount_dec_and_test(&cb->pending_ios)) {
> -		struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
> +	cb->status = bbio->bio.bi_status;
> +	queue_work(fs_info->compressed_write_workers, &cb->write_end_work);
>  
> -		queue_work(fs_info->compressed_write_workers, &cb->write_end_work);
> -	}
>  	bio_put(&bbio->bio);
>  }
>  
> -/*
> - * Allocate a compressed_bio, which will be used to read/write on-disk
> - * (aka, compressed) * data.
> - *
> - * @cb:                 The compressed_bio structure, which records all the needed
> - *                      information to bind the compressed data to the uncompressed
> - *                      page cache.
> - * @disk_byten:         The logical bytenr where the compressed data will be read
> - *                      from or written to.
> - * @endio_func:         The endio function to call after the IO for compressed data
> - *                      is finished.
> - */
> -static struct bio *alloc_compressed_bio(struct compressed_bio *cb, u64 disk_bytenr,
> -					blk_opf_t opf,
> -					btrfs_bio_end_io_t endio_func)
> -{
> -	struct bio *bio;
> -
> -	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, cb->inode, endio_func, cb);
> -	bio->bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
> -
> -	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> -		struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
> -		struct extent_map *em;
> -
> -		em = btrfs_get_chunk_map(fs_info, disk_bytenr,
> -					 fs_info->sectorsize);
> -		if (IS_ERR(em)) {
> -			bio_put(bio);
> -			return ERR_CAST(em);
> -		}
> -
> -		bio_set_dev(bio, em->map_lookup->stripes[0].dev->bdev);
> -		free_extent_map(em);
> -	}
> -	refcount_inc(&cb->pending_ios);
> -	return bio;
> -}
> -
>  /*
>   * worker function to build and submit bios for previously compressed pages.
>   * The corresponding pages in the inode should be marked for writeback
> @@ -329,16 +286,12 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>  	struct compressed_bio *cb;
>  	u64 cur_disk_bytenr = disk_start;
>  	blk_status_t ret = BLK_STS_OK;
> -	const bool use_append = btrfs_use_zone_append(inode, disk_start);
> -	const enum req_op bio_op = REQ_BTRFS_ONE_ORDERED |
> -		(use_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE);
>  
>  	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
>  	       IS_ALIGNED(len, fs_info->sectorsize));
>  	cb = kmalloc(sizeof(struct compressed_bio), GFP_NOFS);
>  	if (!cb)
>  		return BLK_STS_RESOURCE;
> -	refcount_set(&cb->pending_ios, 1);
>  	cb->status = BLK_STS_OK;
>  	cb->inode = &inode->vfs_inode;
>  	cb->start = start;
> @@ -349,8 +302,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>  	INIT_WORK(&cb->write_end_work, btrfs_finish_compressed_write_work);
>  	cb->nr_pages = nr_pages;
>  
> -	if (blkcg_css)
> +	if (blkcg_css) {
>  		kthread_associate_blkcg(blkcg_css);
> +		write_flags |= REQ_CGROUP_PUNT;
> +	}
> +
> +	write_flags |= REQ_BTRFS_ONE_ORDERED;
> +	bio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_WRITE | write_flags,
> +			      cb->inode, end_compressed_bio_write, cb);
> +	bio->bi_iter.bi_sector = cur_disk_bytenr >> SECTOR_SHIFT;
>  
>  	while (cur_disk_bytenr < disk_start + compressed_len) {
>  		u64 offset = cur_disk_bytenr - disk_start;
> @@ -358,19 +318,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>  		unsigned int real_size;
>  		unsigned int added;
>  		struct page *page = compressed_pages[index];
> -		bool submit = false;
> -
> -		/* Allocate new bio if submitted or not yet allocated */
> -		if (!bio) {
> -			bio = alloc_compressed_bio(cb, cur_disk_bytenr,
> -				bio_op | write_flags, end_compressed_bio_write);
> -			if (IS_ERR(bio)) {
> -				ret = errno_to_blk_status(PTR_ERR(bio));
> -				break;
> -			}
> -			if (blkcg_css)
> -				bio->bi_opf |= REQ_CGROUP_PUNT;
> -		}
> +
>  		/*
>  		 * We have various limits on the real read size:
>  		 * - page boundary
> @@ -380,36 +328,21 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>  		real_size = min_t(u64, real_size, compressed_len - offset);
>  		ASSERT(IS_ALIGNED(real_size, fs_info->sectorsize));
>  
> -		if (use_append)
> -			added = bio_add_zone_append_page(bio, page, real_size,
> -					offset_in_page(offset));
> -		else
> -			added = bio_add_page(bio, page, real_size,
> -					offset_in_page(offset));
> -		/* Reached zoned boundary */
> -		if (added == 0)
> -			submit = true;
> -
> +		added = bio_add_page(bio, page, real_size, offset_in_page(offset));
> +		/*
> +		 * Maximum compressed extent is smaller than bio size limit,
> +		 * thus bio_add_page() should always success.
> +		 */
> +		ASSERT(added == real_size);
>  		cur_disk_bytenr += added;
> -
> -		/* Finished the range */
> -		if (cur_disk_bytenr == disk_start + compressed_len)
> -			submit = true;
> -
> -		if (submit) {
> -			ASSERT(bio->bi_iter.bi_size);
> -			btrfs_bio(bio)->file_offset = start;
> -			btrfs_submit_bio(fs_info, bio, 0);
> -			bio = NULL;
> -		}
> -		cond_resched();
>  	}
>  
> +	/* Finished the range */
> +	ASSERT(bio->bi_iter.bi_size);
> +	btrfs_bio(bio)->file_offset = start;
> +	btrfs_submit_bio(fs_info, bio, 0);
>  	if (blkcg_css)
>  		kthread_associate_blkcg(NULL);
> -
> -	if (refcount_dec_and_test(&cb->pending_ios))
> -		finish_compressed_bio_write(cb);
>  	return ret;
>  }
>  
> @@ -613,7 +546,6 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		goto out;
>  	}
>  
> -	refcount_set(&cb->pending_ios, 1);
>  	cb->status = BLK_STS_OK;
>  	cb->inode = inode;
>  
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index 1aa02903de697..25876f7a26949 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -30,9 +30,6 @@ static_assert((BTRFS_MAX_COMPRESSED % PAGE_SIZE) == 0);
>  #define	BTRFS_ZLIB_DEFAULT_LEVEL		3
>  
>  struct compressed_bio {
> -	/* Number of outstanding bios */
> -	refcount_t pending_ios;
> -
>  	/* Number of compressed pages in the array */
>  	unsigned int nr_pages;
>  
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 33e80f8dd0b1b..40dadc46e00d8 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2597,7 +2597,6 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
>  	u32 real_size;
>  	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
>  	bool contig = false;
> -	int ret;
>  
>  	ASSERT(bio);
>  	/* The limit should be calculated when bio_ctrl->bio is allocated */
> @@ -2646,12 +2645,7 @@ static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
>  	if (real_size == 0)
>  		return 0;
>  
> -	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> -		ret = bio_add_zone_append_page(bio, page, real_size, pg_offset);
> -	else
> -		ret = bio_add_page(bio, page, real_size, pg_offset);
> -
> -	return ret;
> +	return bio_add_page(bio, page, real_size, pg_offset);
>  }
>  
>  static void calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
> @@ -2666,7 +2660,7 @@ static void calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
>  	 * to them.
>  	 */
>  	if (bio_ctrl->compress_type == BTRFS_COMPRESS_NONE &&
> -	    bio_op(bio_ctrl->bio) == REQ_OP_ZONE_APPEND) {
> +	    btrfs_use_zone_append(inode, logical)) {
>  		ordered = btrfs_lookup_ordered_extent(inode, file_offset);
>  		if (ordered) {
>  			bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
> @@ -2680,17 +2674,15 @@ static void calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
>  	bio_ctrl->len_to_oe_boundary = U32_MAX;
>  }
>  
> -static int alloc_new_bio(struct btrfs_inode *inode,
> -			 struct btrfs_bio_ctrl *bio_ctrl,
> -			 struct writeback_control *wbc,
> -			 blk_opf_t opf,
> -			 btrfs_bio_end_io_t end_io_func,
> -			 u64 disk_bytenr, u32 offset, u64 file_offset,
> -			 enum btrfs_compression_type compress_type)
> +static void alloc_new_bio(struct btrfs_inode *inode,
> +			  struct btrfs_bio_ctrl *bio_ctrl,
> +			  struct writeback_control *wbc, blk_opf_t opf,
> +			  btrfs_bio_end_io_t end_io_func,
> +			  u64 disk_bytenr, u32 offset, u64 file_offset,
> +			  enum btrfs_compression_type compress_type)
>  {
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  	struct bio *bio;
> -	int ret;
>  
>  	bio = btrfs_bio_alloc(BIO_MAX_VECS, opf, &inode->vfs_inode, end_io_func,
>  			      NULL);
> @@ -2708,40 +2700,14 @@ static int alloc_new_bio(struct btrfs_inode *inode,
>  
>  	if (wbc) {
>  		/*
> -		 * For Zone append we need the correct block_device that we are
> -		 * going to write to set in the bio to be able to respect the
> -		 * hardware limitation.  Look it up here:
> +		 * Pick the last added device to support cgroup writeback.  For
> +		 * multi-device file systems this means blk-cgroup policies have
> +		 * to always be set on the last added/replaced device.
> +		 * This is a bit odd but has been like that for a long time.
>  		 */
> -		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> -			struct btrfs_device *dev;
> -
> -			dev = btrfs_zoned_get_device(fs_info, disk_bytenr,
> -						     fs_info->sectorsize);
> -			if (IS_ERR(dev)) {
> -				ret = PTR_ERR(dev);
> -				goto error;
> -			}
> -
> -			bio_set_dev(bio, dev->bdev);
> -		} else {
> -			/*
> -			 * Otherwise pick the last added device to support
> -			 * cgroup writeback.  For multi-device file systems this
> -			 * means blk-cgroup policies have to always be set on the
> -			 * last added/replaced device.  This is a bit odd but has
> -			 * been like that for a long time.
> -			 */
> -			bio_set_dev(bio, fs_info->fs_devices->latest_dev->bdev);
> -		}
> +		bio_set_dev(bio, fs_info->fs_devices->latest_dev->bdev);
>  		wbc_init_bio(wbc, bio);
> -	} else {
> -		ASSERT(bio_op(bio) != REQ_OP_ZONE_APPEND);
>  	}
> -	return 0;
> -error:
> -	bio_ctrl->bio = NULL;
> -	btrfs_bio_end_io(btrfs_bio(bio), errno_to_blk_status(ret));
> -	return ret;
>  }
>  
>  /*
> @@ -2767,7 +2733,6 @@ static int submit_extent_page(blk_opf_t opf,
>  			      enum btrfs_compression_type compress_type,
>  			      bool force_bio_submit)
>  {
> -	int ret = 0;
>  	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
>  	unsigned int cur = pg_offset;
>  
> @@ -2784,12 +2749,9 @@ static int submit_extent_page(blk_opf_t opf,
>  
>  		/* Allocate new bio if needed */
>  		if (!bio_ctrl->bio) {
> -			ret = alloc_new_bio(inode, bio_ctrl, wbc, opf,
> -					    end_io_func, disk_bytenr, offset,
> -					    page_offset(page) + cur,
> -					    compress_type);
> -			if (ret < 0)
> -				return ret;
> +			alloc_new_bio(inode, bio_ctrl, wbc, opf, end_io_func,
> +				      disk_bytenr, offset,
> +				      page_offset(page) + cur, compress_type);
>  		}
>  		/*
>  		 * We must go through btrfs_bio_add_page() to ensure each
> @@ -3354,10 +3316,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>  		 * find_next_dirty_byte() are all exclusive
>  		 */
>  		iosize = min(min(em_end, end + 1), dirty_range_end) - cur;
> -
> -		if (btrfs_use_zone_append(inode, em->block_start))
> -			op = REQ_OP_ZONE_APPEND;
> -
>  		free_extent_map(em);
>  		em = NULL;
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9c562d36e4570..1a0bf381f2437 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7727,10 +7727,6 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>  	iomap->offset = start;
>  	iomap->bdev = fs_info->fs_devices->latest_dev->bdev;
>  	iomap->length = len;
> -
> -	if (write && btrfs_use_zone_append(BTRFS_I(inode), em->block_start))
> -		iomap->flags |= IOMAP_F_ZONE_APPEND;
> -
>  	free_extent_map(em);
>  
>  	return 0;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e497b63238189..0d828b58cc9c3 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6632,13 +6632,22 @@ struct bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
>  	return bio;
>  }
>  
> -static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
> +static struct bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
> +				   struct bio *orig, u64 map_length,
> +				   bool use_append)
>  {
>  	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
>  	struct bio *bio;
>  
> -	bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
> -			&btrfs_clone_bioset);
> +	if (use_append) {
> +		unsigned int nr_segs;
> +
> +		bio = bio_split_rw(orig, &fs_info->limits, &nr_segs,
> +				   &btrfs_clone_bioset, map_length);
> +	} else {
> +		bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
> +				&btrfs_clone_bioset);
> +	}
>  	btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode, NULL, orig_bbio);
>  
>  	btrfs_bio(bio)->file_offset = orig_bbio->file_offset;
> @@ -6970,16 +6979,10 @@ static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
>  	 */
>  	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>  		u64 physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +		u64 zone_start = round_down(physical, dev->fs_info->zone_size);
>  
> -		if (btrfs_dev_is_sequential(dev, physical)) {
> -			u64 zone_start = round_down(physical,
> -						    dev->fs_info->zone_size);
> -
> -			bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
> -		} else {
> -			bio->bi_opf &= ~REQ_OP_ZONE_APPEND;
> -			bio->bi_opf |= REQ_OP_WRITE;
> -		}
> +		ASSERT(btrfs_dev_is_sequential(dev, physical));
> +		bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
>  	}
>  	btrfs_debug_in_rcu(dev->fs_info,
>  	"%s: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
> @@ -7179,9 +7182,11 @@ static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
>  			       int mirror_num)
>  {
>  	struct btrfs_bio *bbio = btrfs_bio(bio);
> +	struct btrfs_inode *bi = BTRFS_I(bbio->inode);
>  	u64 logical = bio->bi_iter.bi_sector << 9;
>  	u64 length = bio->bi_iter.bi_size;
>  	u64 map_length = length;
> +	bool use_append = btrfs_use_zone_append(bi, logical);
>  	struct btrfs_io_context *bioc = NULL;
>  	struct btrfs_io_stripe smap;
>  	int ret;
> @@ -7193,8 +7198,11 @@ static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
>  		goto fail;
>  
>  	map_length = min(map_length, length);
> +	if (use_append)
> +		map_length = min(map_length, fs_info->max_zone_append_size);
> +
>  	if (map_length < length) {
> -		bio = btrfs_split_bio(bio, map_length);
> +		bio = btrfs_split_bio(fs_info, bio, map_length, use_append);
>  		bbio = btrfs_bio(bio);
>  	}
>  
> @@ -7210,9 +7218,9 @@ static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
>  	}
>  
>  	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
> -		struct btrfs_inode *bi = BTRFS_I(bbio->inode);
> -
> -		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +		if (use_append) {
> +			bio->bi_opf &= ~REQ_OP_WRITE;
> +			bio->bi_opf |= REQ_OP_ZONE_APPEND;
>  			ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
>  			if (ret)
>  				goto fail_put_bio;
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 6e04fbbd76b92..988e9fc5a6b7b 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1818,26 +1818,6 @@ int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
>  	return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length);
>  }
>  
> -struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
> -					    u64 logical, u64 length)
> -{
> -	struct btrfs_device *device;
> -	struct extent_map *em;
> -	struct map_lookup *map;
> -
> -	em = btrfs_get_chunk_map(fs_info, logical, length);
> -	if (IS_ERR(em))
> -		return ERR_CAST(em);
> -
> -	map = em->map_lookup;
> -	/* We only support single profile for now */
> -	device = map->stripes[0].dev;
> -
> -	free_extent_map(em);
> -
> -	return device;
> -}
> -
>  /**
>   * Activate block group and underlying device zones
>   *
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index 0f22b22fe359f..74153ab52169f 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -64,8 +64,6 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
>  int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
>  int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
>  				  u64 physical_start, u64 physical_pos);
> -struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
> -					    u64 logical, u64 length);
>  bool btrfs_zone_activate(struct btrfs_block_group *block_group);
>  int btrfs_zone_finish(struct btrfs_block_group *block_group);
>  bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags);
> @@ -209,13 +207,6 @@ static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev,
>  	return -EOPNOTSUPP;
>  }
>  
> -static inline struct btrfs_device *btrfs_zoned_get_device(
> -						  struct btrfs_fs_info *fs_info,
> -						  u64 logical, u64 length)
> -{
> -	return ERR_PTR(-EOPNOTSUPP);
> -}
> -
>  static inline bool btrfs_zone_activate(struct btrfs_block_group *block_group)
>  {
>  	return true;

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode
  2022-09-01  7:42 ` [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode Christoph Hellwig
  2022-09-01 11:28   ` Johannes Thumshirn
@ 2022-09-02  1:56   ` Damien Le Moal
  2022-09-02  1:59     ` Damien Le Moal
  2022-09-05  6:54     ` Christoph Hellwig
  1 sibling, 2 replies; 108+ messages in thread
From: Damien Le Moal @ 2022-09-02  1:56 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 9/1/22 16:42, Christoph Hellwig wrote:
> To be able to split a write into properly sized zone append commands,
> we need a queue_limits structure that contains the least common
> denominator suitable for all devices.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/ctree.h |  4 +++-
>  fs/btrfs/zoned.c | 36 ++++++++++++++++++------------------
>  fs/btrfs/zoned.h |  1 -
>  3 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5e57e3c6a1fd6..a37129363e184 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1071,8 +1071,10 @@ struct btrfs_fs_info {
>  	 */
>  	u64 zone_size;
>  
> -	/* Max size to emit ZONE_APPEND write command */
> +	/* Constraints for ZONE_APPEND commands: */
> +	struct queue_limits limits;
>  	u64 max_zone_append_size;

Can't we get rid of this one and have the code directly use
fs_info->limits.max_zone_append_sectors through a little helper doing a
conversion to bytes (a 9 bit shift) ?

[...]
>  	/* Count zoned devices */
>  	list_for_each_entry(device, &fs_devices->devices, dev_list) {
>  		enum blk_zoned_model model;
> @@ -685,11 +677,9 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>  				ret = -EINVAL;
>  				goto out;
>  			}
> -			if (!max_zone_append_size ||
> -			    (zone_info->max_zone_append_size &&
> -			     zone_info->max_zone_append_size < max_zone_append_size))
> -				max_zone_append_size =
> -					zone_info->max_zone_append_size;
> +			blk_stack_limits(lim,
> +					 &bdev_get_queue(device->bdev)->limits,
> +					 0);

This does:

	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
                                        b->max_zone_append_sectors);

So if we are mixing zoned and non-zoned devices in a multi-dev volume,
we'll end up with max_zone_append_sectors being 0. The previous code
prevented that.

Note that I am not sure if it is allowed to mix zoned and non-zoned drives
in the same volume. Given that we have a fake zone emulation for non-zoned
drives with zoned btrfs, I do not see why it would not work. But I may be
wrong.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode
  2022-09-02  1:56   ` Damien Le Moal
@ 2022-09-02  1:59     ` Damien Le Moal
  2022-09-05  6:54     ` Christoph Hellwig
  1 sibling, 0 replies; 108+ messages in thread
From: Damien Le Moal @ 2022-09-02  1:59 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 9/2/22 10:56, Damien Le Moal wrote:
> On 9/1/22 16:42, Christoph Hellwig wrote:
>> To be able to split a write into properly sized zone append commands,
>> we need a queue_limits structure that contains the least common
>> denominator suitable for all devices.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>  fs/btrfs/ctree.h |  4 +++-
>>  fs/btrfs/zoned.c | 36 ++++++++++++++++++------------------
>>  fs/btrfs/zoned.h |  1 -
>>  3 files changed, 21 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 5e57e3c6a1fd6..a37129363e184 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1071,8 +1071,10 @@ struct btrfs_fs_info {
>>  	 */
>>  	u64 zone_size;
>>  
>> -	/* Max size to emit ZONE_APPEND write command */
>> +	/* Constraints for ZONE_APPEND commands: */
>> +	struct queue_limits limits;
>>  	u64 max_zone_append_size;
> 
> Can't we get rid of this one and have the code directly use
> fs_info->limits.max_zone_append_sectors through a little helper doing a
> conversion to bytes (a 9 bit shift) ?

Note: Only a suggestion, not sure that would be much of a cleanup.

> 
> [...]
>>  	/* Count zoned devices */
>>  	list_for_each_entry(device, &fs_devices->devices, dev_list) {
>>  		enum blk_zoned_model model;
>> @@ -685,11 +677,9 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>>  				ret = -EINVAL;
>>  				goto out;
>>  			}
>> -			if (!max_zone_append_size ||
>> -			    (zone_info->max_zone_append_size &&
>> -			     zone_info->max_zone_append_size < max_zone_append_size))
>> -				max_zone_append_size =
>> -					zone_info->max_zone_append_size;
>> +			blk_stack_limits(lim,
>> +					 &bdev_get_queue(device->bdev)->limits,
>> +					 0);
> 
> This does:
> 
> 	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
>                                         b->max_zone_append_sectors);
> 
> So if we are mixing zoned and non-zoned devices in a multi-dev volume,
> we'll end up with max_zone_append_sectors being 0. The previous code
> prevented that.
> 
> Note that I am not sure if it is allowed to mix zoned and non-zoned drives
> in the same volume. Given that we have a fake zone emulation for non-zoned
> drives with zoned btrfs, I do not see why it would not work. But I may be
> wrong.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (16 preceding siblings ...)
  2022-09-01  7:42 ` [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND Christoph Hellwig
@ 2022-09-02 15:18 ` Johannes Thumshirn
  2022-09-07  9:10 ` code placement for bio / storage layer code Christoph Hellwig
  2022-10-24  8:12 ` consolidate btrfs checksumming, repair and bio splitting Johannes Thumshirn
  19 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-02 15:18 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On 01.09.22 09:42, Christoph Hellwig wrote:
> Hi all,
> 
> this series moves a large amount of duplicate code below btrfs_submit_bio
> into what I call the 'storage' layer.  Instead of duplicating code to
> checksum, check checksums and repair and split bios in all the caller
> of btrfs_submit_bio (buffered I/O, direct I/O, compressed I/O, encoded
> I/O), the work is done one in a central place, often more optiomal and
> without slight changes in behavior.  Once that is done the upper layers
> also don't need to split the bios for extent boundaries, as the storage
> layer can do that itself, including splitting the bios for the zone
> append limits for zoned I/O.
> 
> The split work is inspired by an earlier series from Qu, from which it
> also reuses a few patches.
> 
> Note: this adds a fair amount of code to volumes.c, which already is
> quite large.  It might make sense to add a prep patch to move
> btrfs_submit_bio into a new bio.c file, but I only want to do that
> if we have agreement on the move as the conflicts will be painful
> when rebasing.

This series on top of misc-next passes my usual zoned null_blk fstests
setup without regressions.

Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/17] block: export bio_split_rw
  2022-09-01  8:54   ` Qu Wenruo
@ 2022-09-05  6:44     ` Christoph Hellwig
  2022-09-05  6:51       ` Qu Wenruo
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05  6:44 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Thu, Sep 01, 2022 at 04:54:32PM +0800, Qu Wenruo wrote:
> I found the queue_limits structure pretty scary, while we only have very
> limited members used in this case:
>
> - lim->virt_boundary_mask
>   Used in bvec_gap_to_prev()
>
> - lim->max_segments
>
> - lim->seg_boundary_mask
> - lim->max_segment_size
>   Used in bvec_split_segs()
>
> - lim->logical_block_size
>
> Not familiar with block layer, thus I'm wondering do btrfs really need a
> full queue_limits structure to call bio_split_rw().

Well, the queue limits is what the block layer uses for communicating
the I/O size limitations, and thus both bio_split_rw and the stacking
layer helpers operate on it. 

> Or can we have a simplified wrapper?

I don't think we can simplify anything here.  The alternative would
be to open code the I/O path logic, which means a lot more code that
needs to be maintained and has a high probability to get out of sync
with the block layer logic.  So I'd much rather share this code
between everything that stacks block devices, be that to represent
another block device on the top like dm/md or for a 'direct' stacking
in the file system like btrfs does.

> IIRC inside btrfs we only need two cases for bio split:
>
> - Split for stripe boundary
>
> - Split for OE/zoned boundary

No.  For zoned devices we all limitations for bio, basically all that
you mentioned above.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-01  9:04   ` Qu Wenruo
@ 2022-09-05  6:48     ` Christoph Hellwig
  2022-09-05  6:59       ` Qu Wenruo
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05  6:48 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Thu, Sep 01, 2022 at 05:04:34PM +0800, Qu Wenruo wrote:
> But for the verification part, I still don't like the idea of putting
> the verification code at endio context at all.

Why?

> This is especially true when data and metadata are still doing different
> checksum verfication at different timing.

Note that this does not handle the metadata checksum verification at
all.  Both because it actually works very different and I could not
verify that we'd actually always read all data that needs to be verified
together for metadata, but also because there is zero metadata repair
coverage in xfstests, so I don't dare to touch that code.

> Can we just let the endio function to do the IO, and let the reader to
> do the verification after all needed data is read out?

What would the benefit be?  It will lead to a lot of duplicate (and thus
inconsistent) code that is removed here, and make splitting the bios
under btrfs_submit_bio much more complicated and expensive.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O
  2022-09-01  9:56   ` Johannes Thumshirn
@ 2022-09-05  6:49     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05  6:49 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 09:56:05AM +0000, Johannes Thumshirn wrote:
> On 01.09.22 09:43, Christoph Hellwig wrote:
> > +	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> > +		struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
> > +		struct extent_map *em;
> >  
> > -	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> > -		bio_set_dev(bio, em->map_lookup->stripes[0].dev->bdev);
> > +		em = btrfs_get_chunk_map(fs_info, disk_bytenr,
> > +					 fs_info->sectorsize);
> > +		if (IS_ERR(em)) {
> > +			bio_put(bio);
> > +			return ERR_CAST(em);
> > +		}
> 
> Please use btrfs_get_zoned_device() instead of open coding it.

I though of that, decided againt doing this in this patch as an
unrelated patch and moved it to a separate cleanup.  And then
I noticed that btrfs_get_zoned_device goes away later in the series
entirely, so I dropped that patch again..

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND
  2022-09-02  1:38   ` Damien Le Moal
@ 2022-09-05  6:50     ` Christoph Hellwig
  2022-09-05  6:57       ` Damien Le Moal
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05  6:50 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Fri, Sep 02, 2022 at 10:38:50AM +0900, Damien Le Moal wrote:
> On 9/1/22 16:42, Christoph Hellwig wrote:
> > No users left now that btrfs takes REQ_OP_WRITE bios from iomap and
> > splits and converts them to REQ_OP_ZONE_APPEND internally.
> 
> Hu... I wanted to use that for zonefs for doing ZONE APPEND with AIOs...
> Need to revisit that code anyway, so fine for now.

We could resurrect it.  But I suspect that you're better off doing
what btrfs does here - let iomap submit a write bio and then split
it in the submit_bio hook.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode
  2022-09-01 11:28   ` Johannes Thumshirn
@ 2022-09-05  6:50     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05  6:50 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 11:28:02AM +0000, Johannes Thumshirn wrote:
> On 01.09.22 09:43, Christoph Hellwig wrote:
> > To be able to split a write into properly sized zone append commands,
> > we need a queue_limits structure that contains the least common
> > denominator suitable for all devices.
> > 
> 
> This patch conflicts with Shinichiro's patch restoring functionality 
> of the zone emulation mode.

Looks like that got into misc-next in the meantime, so I'll rebase on
top of that.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/17] block: export bio_split_rw
  2022-09-05  6:44     ` Christoph Hellwig
@ 2022-09-05  6:51       ` Qu Wenruo
  0 siblings, 0 replies; 108+ messages in thread
From: Qu Wenruo @ 2022-09-05  6:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel



On 2022/9/5 14:44, Christoph Hellwig wrote:
> On Thu, Sep 01, 2022 at 04:54:32PM +0800, Qu Wenruo wrote:
>> I found the queue_limits structure pretty scary, while we only have very
>> limited members used in this case:
>>
>> - lim->virt_boundary_mask
>>    Used in bvec_gap_to_prev()
>>
>> - lim->max_segments
>>
>> - lim->seg_boundary_mask
>> - lim->max_segment_size
>>    Used in bvec_split_segs()
>>
>> - lim->logical_block_size
>>
>> Not familiar with block layer, thus I'm wondering do btrfs really need a
>> full queue_limits structure to call bio_split_rw().
>
> Well, the queue limits is what the block layer uses for communicating
> the I/O size limitations, and thus both bio_split_rw and the stacking
> layer helpers operate on it.
>
>> Or can we have a simplified wrapper?
>
> I don't think we can simplify anything here.  The alternative would
> be to open code the I/O path logic, which means a lot more code that
> needs to be maintained and has a high probability to get out of sync
> with the block layer logic.  So I'd much rather share this code
> between everything that stacks block devices, be that to represent
> another block device on the top like dm/md or for a 'direct' stacking
> in the file system like btrfs does.
>
>> IIRC inside btrfs we only need two cases for bio split:
>>
>> - Split for stripe boundary
>>
>> - Split for OE/zoned boundary
>
> No.  For zoned devices we all limitations for bio, basically all that
> you mentioned above.

OK, that explains the reason for the full queue_limits exported.

Then it makes sense to me now.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode
  2022-09-02  1:56   ` Damien Le Moal
  2022-09-02  1:59     ` Damien Le Moal
@ 2022-09-05  6:54     ` Christoph Hellwig
  1 sibling, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05  6:54 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Fri, Sep 02, 2022 at 10:56:40AM +0900, Damien Le Moal wrote:
> > -	/* Max size to emit ZONE_APPEND write command */
> > +	/* Constraints for ZONE_APPEND commands: */
> > +	struct queue_limits limits;
> >  	u64 max_zone_append_size;
> 
> Can't we get rid of this one and have the code directly use
> fs_info->limits.max_zone_append_sectors through a little helper doing a
> conversion to bytes (a 9 bit shift) ?

Well, the helper would be a little more complicated, doing three
different shifts, a max3 and and ALIGN_DOWN.  That's why I thought
I'd rather cache the value then recalculating it on every write.  But
either way would be entirely feasible.

> This does:
> 
> 	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
>                                         b->max_zone_append_sectors);
> 
> So if we are mixing zoned and non-zoned devices in a multi-dev volume,
> we'll end up with max_zone_append_sectors being 0. The previous code
> prevented that.
> 
> Note that I am not sure if it is allowed to mix zoned and non-zoned drives
> in the same volume. Given that we have a fake zone emulation for non-zoned
> drives with zoned btrfs, I do not see why it would not work. But I may be
> wrong.

Yes, this could be problematic.  I wonder if we need to initialize
max_zone_append_sectors to max_hw_sectors by default and use a separate
flag if it is actually supported in the block layer if we want to
support that.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-02  1:46   ` Damien Le Moal
@ 2022-09-05  6:55     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05  6:55 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Fri, Sep 02, 2022 at 10:46:13AM +0900, Damien Le Moal wrote:
> On 9/1/22 16:42, Christoph Hellwig wrote:
> > The current btrfs zoned device support is a little cumbersome in the data
> > I/O path as it requires the callers to not support more I/O than the
> > supported ZONE_APPEND size by the underlying device.  This leads to a lot
> 
> Did you mean: "...as it requires the callers to not issue I/O larger than
> the supported ZONE_APPEND size for the underlying device." ?
> I think you do mean that :)

Yes.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND
  2022-09-05  6:50     ` Christoph Hellwig
@ 2022-09-05  6:57       ` Damien Le Moal
  0 siblings, 0 replies; 108+ messages in thread
From: Damien Le Moal @ 2022-09-05  6:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On 9/5/22 15:50, Christoph Hellwig wrote:
> On Fri, Sep 02, 2022 at 10:38:50AM +0900, Damien Le Moal wrote:
>> On 9/1/22 16:42, Christoph Hellwig wrote:
>>> No users left now that btrfs takes REQ_OP_WRITE bios from iomap and
>>> splits and converts them to REQ_OP_ZONE_APPEND internally.
>>
>> Hu... I wanted to use that for zonefs for doing ZONE APPEND with AIOs...
>> Need to revisit that code anyway, so fine for now.
> 
> We could resurrect it.  But I suspect that you're better off doing
> what btrfs does here - let iomap submit a write bio and then split
> it in the submit_bio hook.

Nope, we cannot do that for zonefs as the data mapping is implied directly
from the written offset (no metadata) so we cannot split an async write
into multiple zone append BIOs, since the single aio write data may end up
all mingled due to possible reordering of the fragment BIOs.

But the need for a split can be checked before submission and an error
returned if the aio write is too large. So this method of using the
submit_bio hook still works and will simply turn a write into a zone
append if the file was open with O_APPEND.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-05  6:48     ` Christoph Hellwig
@ 2022-09-05  6:59       ` Qu Wenruo
  2022-09-05 14:31         ` Christoph Hellwig
  0 siblings, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-09-05  6:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel



On 2022/9/5 14:48, Christoph Hellwig wrote:
> On Thu, Sep 01, 2022 at 05:04:34PM +0800, Qu Wenruo wrote:
>> But for the verification part, I still don't like the idea of putting
>> the verification code at endio context at all.
>
> Why?

Mostly due to the fact that metadata and data go split ways for
verification.

All the verification for data happens at endio time.

While part of the verification of metadata (bytenr, csum, level,
tree-checker) goes at endio, but transid, checks against parent are all
done at btrfs_read_extent_buffer() time.

This also means, the read-repair happens at different timing.

>
>> This is especially true when data and metadata are still doing different
>> checksum verfication at different timing.
>
> Note that this does not handle the metadata checksum verification at
> all.  Both because it actually works very different and I could not
> verify that we'd actually always read all data that needs to be verified
> together for metadata, but also because there is zero metadata repair
> coverage in xfstests, so I don't dare to touch that code.
>
>> Can we just let the endio function to do the IO, and let the reader to
>> do the verification after all needed data is read out?
>
> What would the benefit be?  It will lead to a lot of duplicate (and thus
> inconsistent) code that is removed here, and make splitting the bios
> under btrfs_submit_bio much more complicated and expensive.

You're right, my initial suggestion is not good at all.


But what about putting all the needed metadata info (first key, level,
transid etc) also into bbio (using a union to take the same space of
data csum), so that all verification and read repair can happen at endio
time, the same timing as data?

Although this may need to force submitting metadata for every tree
block, I guess it's more or less feasible, since metadata read is more
like a random read, other than sequential read.

By this we can also eliminate the duplicated read-repair between meta
and data.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-01  7:42 ` [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio Christoph Hellwig
  2022-09-02  1:46   ` Damien Le Moal
@ 2022-09-05 13:15   ` Johannes Thumshirn
  2022-09-05 14:25     ` Christoph Hellwig
  2022-09-07 21:17   ` Josef Bacik
  2 siblings, 1 reply; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-05 13:15 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On 01.09.22 09:43, Christoph Hellwig wrote:
> -		if (btrfs_dev_is_sequential(dev, physical)) {
> -			u64 zone_start = round_down(physical,
> -						    dev->fs_info->zone_size);
> -
> -			bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
> -		} else {
> -			bio->bi_opf &= ~REQ_OP_ZONE_APPEND;
> -			bio->bi_opf |= REQ_OP_WRITE;
> -		}
> +		ASSERT(btrfs_dev_is_sequential(dev, physical));
> +		bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;

That ASSERT() will trigger on conventional zones, won't it?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-05 13:15   ` Johannes Thumshirn
@ 2022-09-05 14:25     ` Christoph Hellwig
  2022-09-05 14:31       ` Johannes Thumshirn
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05 14:25 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Mon, Sep 05, 2022 at 01:15:16PM +0000, Johannes Thumshirn wrote:
> On 01.09.22 09:43, Christoph Hellwig wrote:
> > +		ASSERT(btrfs_dev_is_sequential(dev, physical));
> > +		bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
> 
> That ASSERT() will trigger on conventional zones, won't it?

The assert is inside a

	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {

btrfs_submit_chunk only sets the op to REQ_OP_ZONE_APPEND when
btrfs_use_zone_append returns true, which excludes conventional zones.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-05  6:59       ` Qu Wenruo
@ 2022-09-05 14:31         ` Christoph Hellwig
  2022-09-05 22:34           ` Qu Wenruo
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05 14:31 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Mon, Sep 05, 2022 at 02:59:33PM +0800, Qu Wenruo wrote:
> Mostly due to the fact that metadata and data go split ways for
> verification.
>
> All the verification for data happens at endio time.

Yes.

> While part of the verification of metadata (bytenr, csum, level,
> tree-checker) goes at endio, but transid, checks against parent are all
> done at btrfs_read_extent_buffer() time.
>
> This also means, the read-repair happens at different timing.

Yes.  read-repair for metadata currently is very different than that
from data.  But that is something that exists already in is not new
in this series.

> But what about putting all the needed metadata info (first key, level,
> transid etc) also into bbio (using a union to take the same space of
> data csum), so that all verification and read repair can happen at endio
> time, the same timing as data?

I thought about that.  And I suspect it probably is the right thing
to do.  I'm mostly stayed away from it because it doesn't really
help with the goal in this series, and I also don't have good
code coverage to fail comfortable touching the metadata checksum
handling and repair.  I can offer this sneaky deal:  if someone
help creating good metadata repair coverage in xfstests, I will look
into this next.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-05 14:25     ` Christoph Hellwig
@ 2022-09-05 14:31       ` Johannes Thumshirn
  2022-09-05 14:39         ` Christoph Hellwig
  0 siblings, 1 reply; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-05 14:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On 05.09.22 16:25, Christoph Hellwig wrote:
> On Mon, Sep 05, 2022 at 01:15:16PM +0000, Johannes Thumshirn wrote:
>> On 01.09.22 09:43, Christoph Hellwig wrote:
>>> +		ASSERT(btrfs_dev_is_sequential(dev, physical));
>>> +		bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
>>
>> That ASSERT() will trigger on conventional zones, won't it?
> 
> The assert is inside a
> 
> 	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> 
> btrfs_submit_chunk only sets the op to REQ_OP_ZONE_APPEND when
> btrfs_use_zone_append returns true, which excludes conventional zones.
> 


hmm I got that one triggered with fsx:

+ /home/johannes/src/fstests/ltp/fsx -d /mnt/test/test                                                                                                                                
Seed set to 1                                                   
main: filesystem does not support fallocate mode 0, disabling!                  
main: filesystem does not support fallocate mode FALLOC_FL_KEEP_SIZE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_ZERO_RANGE, disabling!                                                                                                                                                                                                                                                                                           
main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!      
main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!        
1 mapwrite      0x27d31 thru    0x3171f (0x99ef bytes)                          
[    2.399348] assertion failed: btrfs_dev_is_sequential(dev, physical), in fs/btrfs/volumes.c:7034
[    2.400881] ------------[ cut here ]------------                             
[    2.401677] kernel BUG at fs/btrfs/ctree.h:3772!                                                                                                                                                                                                                                                                                                                         
[    2.402463] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI                                 
[    2.402943] CPU: 0 PID: 233 Comm: fsx Not tainted 6.0.0-rc3-raid-stripe-tree-bio-split #313
[    2.402943] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[    2.402943] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]                                                                                                                     
[    2.402943] Code: 83 ff ff 48 89 d9 48 89 ea 48 c7 c6 48 d9 1d a0 eb e5 89 f1 48 c7 c2 68 51 1d a0 48 89 fe 48 c7 c7 b0 d9 1d a0 e8 83 b0 4f e1 <0f> 0b be bf 16 00 00 48 c7 c7 d8 d9 1d a0 e8 d5 ff ff ff 49 8b 85
[    2.402943] RSP: 0018:ffffc9000015f8a8 EFLAGS: 00010286         
[    2.402943] RAX: 0000000000000054 RBX: ffff888103f35428 RCX: 0000000000000000
[    2.402943] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00000000ffffffff                                                                                                      
[    2.402943] RBP: ffff88811ad33148 R08: 00000000ffffefff R09: 00000000ffffefff                                                                                                      
[    2.402943] R10: ffffffff8203cf80 R11: ffffffff8203cf80 R12: ffff88811ad330c0
[    2.402943] R13: 0000000000000002 R14: 0000000000000002 R15: ffff8881004957e8           
[    2.402943] FS:  00007f87c5d23740(0000) GS:ffff888627c00000(0000) knlGS:0000000000000000                                                                                                                                                                                                                                                                                 
[    2.402943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033          
[    2.402943] CR2: 00007f87c5ca0000 CR3: 0000000103fa8000 CR4: 00000000000006b0
[    2.402943] Call Trace:                                                                                                                                                            
[    2.402943]  <TASK>                                                          
[    2.402943]  btrfs_submit_dev_bio.cold+0x11/0x11 [btrfs]                                                                                                                           
[    2.402943]  __btrfs_submit_bio+0x8e/0x150 [btrfs]                           
[    2.402943]  btrfs_submit_chunk+0x12e/0x450 [btrfs]                                                                                                                                
[    2.402943]  btrfs_submit_bio+0x1e/0x30 [btrfs]                   
[    2.402943]  submit_one_bio+0x89/0xc0 [btrfs]                                                                                                                                      
[    2.402943]  extent_write_locked_range+0x1d9/0x1f0 [btrfs]            
[    2.402943]  run_delalloc_zoned+0x74/0x160 [btrfs]                                      
[    2.402943]  btrfs_run_delalloc_range+0x16f/0x5e0 [btrfs]                               
[    2.402943]  ? find_lock_delalloc_range+0x27b/0x290 [btrfs]             
[    2.402943]  writepage_delalloc+0xb9/0x180 [btrfs]                
[    2.402943]  __extent_writepage+0x17f/0x340 [btrfs]                
[    2.402943]  extent_write_cache_pages+0x193/0x410 [btrfs]      
[    2.402943]  ? rt_mutex_trylock+0x2b/0x90
[    2.402943]  extent_writepages+0x60/0xe0 [btrfs]
[    2.402943]  do_writepages+0xac/0x180                                                                                                                                              
[    2.402943]  ? balance_dirty_pages_ratelimited_flags+0xcd/0xb10   
[    2.402943]  ? btrfs_inode_rsv_release+0x52/0xe0 [btrfs]                 
[    2.402943]  ? preempt_count_add+0x4e/0xb0                                 
[    2.402943]  filemap_fdatawrite_range+0x76/0x80
[    2.402943]  start_ordered_ops.constprop.0+0x37/0x80 [btrfs]              
[    2.402943]  btrfs_sync_file+0xb7/0x500 [btrfs]                    
[    2.402943]  __do_sys_msync+0x1dd/0x310                                  
[    2.402943]  do_syscall_64+0x42/0x90                                                    
[    2.402943]  entry_SYSCALL_64_after_hwframe+0x63/0xcd               
[    2.402943] RIP: 0033:0x7f87c5e31197           
[    2.402943] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 1a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[    2.402943] RSP: 002b:00007fff06ca0788 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
[    2.402943] RAX: ffffffffffffffda RBX: 0000000000000d31 RCX: 00007f87c5e31197
[    2.402943] RDX: 0000000000000004 RSI: 000000000000a720 RDI: 00007f87c5c96000
[    2.402943] RBP: 0000000000027d31 R08: 0000000000000000 R09: 0000000000027000
[    2.402943] R10: 00007f87c5d33578 R11: 0000000000000246 R12: 00000000000099ef           
[    2.402943] R13: 000000000000a720 R14: 00007f87c5c96000 R15: 00007f87c5f6b000
[    2.402943]  </TASK>
[    2.402943] Modules linked in: btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq zstd_decompress zstd_compress xxhash null_blk
[    2.402943] Dumping ftrace buffer:
[    2.402943]    (ftrace buffer empty)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-05 14:31       ` Johannes Thumshirn
@ 2022-09-05 14:39         ` Christoph Hellwig
  2022-09-05 14:43           ` Johannes Thumshirn
  2022-09-05 15:30           ` Johannes Thumshirn
  0 siblings, 2 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-05 14:39 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Mon, Sep 05, 2022 at 02:31:53PM +0000, Johannes Thumshirn wrote:
> hmm I got that one triggered with fsx:
> 
> + /home/johannes/src/fstests/ltp/fsx -d /mnt/test/test

Odd.  Is this a raid stripe tree setup where one copy is using zone
append and another isn't?  Because without that I can't see how
this would happen.  If not cane you send me the reproducer including
the mkfs line?


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-05 14:39         ` Christoph Hellwig
@ 2022-09-05 14:43           ` Johannes Thumshirn
  2022-09-05 15:30           ` Johannes Thumshirn
  1 sibling, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-05 14:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On 05.09.22 16:40, Christoph Hellwig wrote:
> On Mon, Sep 05, 2022 at 02:31:53PM +0000, Johannes Thumshirn wrote:
>> hmm I got that one triggered with fsx:
>>
>> + /home/johannes/src/fstests/ltp/fsx -d /mnt/test/test
> 
> Odd.  Is this a raid stripe tree setup where one copy is using zone
> append and another isn't?  Because without that I can't see how
> this would happen.  If not cane you send me the reproducer including
> the mkfs line?
> 
> 

The stripe tree doesn't touch anything before endio, but to be save I'm
retesting without my patches applied.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-05 14:39         ` Christoph Hellwig
  2022-09-05 14:43           ` Johannes Thumshirn
@ 2022-09-05 15:30           ` Johannes Thumshirn
  1 sibling, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-05 15:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On 05.09.22 16:40, Christoph Hellwig wrote:
> On Mon, Sep 05, 2022 at 02:31:53PM +0000, Johannes Thumshirn wrote:
>> hmm I got that one triggered with fsx:
>>
>> + /home/johannes/src/fstests/ltp/fsx -d /mnt/test/test
> 
> Odd.  Is this a raid stripe tree setup where one copy is using zone
> append and another isn't?  Because without that I can't see how
> this would happen.  If not cane you send me the reproducer including
> the mkfs line?
> 
> 

OK it seems totally unrelated to your patchset and introduced by my patches.
Odd but not your problem, sorry for the noise

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-05 14:31         ` Christoph Hellwig
@ 2022-09-05 22:34           ` Qu Wenruo
  2022-09-06  4:34             ` Christoph Hellwig
  0 siblings, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-09-05 22:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel



On 2022/9/5 22:31, Christoph Hellwig wrote:
> On Mon, Sep 05, 2022 at 02:59:33PM +0800, Qu Wenruo wrote:
>> Mostly due to the fact that metadata and data go split ways for
>> verification.
>>
>> All the verification for data happens at endio time.
>
> Yes.
>
>> While part of the verification of metadata (bytenr, csum, level,
>> tree-checker) goes at endio, but transid, checks against parent are all
>> done at btrfs_read_extent_buffer() time.
>>
>> This also means, the read-repair happens at different timing.
>
> Yes.  read-repair for metadata currently is very different than that
> from data.  But that is something that exists already in is not new
> in this series.
>
>> But what about putting all the needed metadata info (first key, level,
>> transid etc) also into bbio (using a union to take the same space of
>> data csum), so that all verification and read repair can happen at endio
>> time, the same timing as data?
>
> I thought about that.  And I suspect it probably is the right thing
> to do.  I'm mostly stayed away from it because it doesn't really
> help with the goal in this series, and I also don't have good
> code coverage to fail comfortable touching the metadata checksum
> handling and repair.  I can offer this sneaky deal:  if someone
> help creating good metadata repair coverage in xfstests, I will look
> into this next.

Then may I take this work since it's mostly independent and you can
continue your existing work without being distracted?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-05 22:34           ` Qu Wenruo
@ 2022-09-06  4:34             ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-06  4:34 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Tue, Sep 06, 2022 at 06:34:40AM +0800, Qu Wenruo wrote:
>> I thought about that.  And I suspect it probably is the right thing
>> to do.  I'm mostly stayed away from it because it doesn't really
>> help with the goal in this series, and I also don't have good
>> code coverage to fail comfortable touching the metadata checksum
>> handling and repair.  I can offer this sneaky deal:  if someone
>> help creating good metadata repair coverage in xfstests, I will look
>> into this next.
>
> Then may I take this work since it's mostly independent and you can
> continue your existing work without being distracted?

Fine with me as well.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* code placement for bio / storage layer code
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (17 preceding siblings ...)
  2022-09-02 15:18 ` consolidate btrfs checksumming, repair and bio splitting Johannes Thumshirn
@ 2022-09-07  9:10 ` Christoph Hellwig
  2022-09-07  9:46   ` Johannes Thumshirn
                     ` (2 more replies)
  2022-10-24  8:12 ` consolidate btrfs checksumming, repair and bio splitting Johannes Thumshirn
  19 siblings, 3 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-07  9:10 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, Johannes Thumshirn, Qu Wenruo, linux-btrfs

Hi all,

On Thu, Sep 01, 2022 at 10:41:59AM +0300, Christoph Hellwig wrote:
> Note: this adds a fair amount of code to volumes.c, which already is
> quite large.  It might make sense to add a prep patch to move
> btrfs_submit_bio into a new bio.c file, but I only want to do that
> if we have agreement on the move as the conflicts will be painful
> when rebasing.

any comments on this question?  Should I just keep adding this code
to volumes.c?  Or create a new bio.c?  If so I could send out a
small prep series to do the move of the existing code ASAP.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: code placement for bio / storage layer code
  2022-09-07  9:10 ` code placement for bio / storage layer code Christoph Hellwig
@ 2022-09-07  9:46   ` Johannes Thumshirn
  2022-09-07 10:28   ` Qu Wenruo
  2022-10-10  8:01   ` Johannes Thumshirn
  2 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-09-07  9:46 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, Qu Wenruo, linux-btrfs

On 07.09.22 11:11, Christoph Hellwig wrote:
> Hi all,
> 
> On Thu, Sep 01, 2022 at 10:41:59AM +0300, Christoph Hellwig wrote:
>> Note: this adds a fair amount of code to volumes.c, which already is
>> quite large.  It might make sense to add a prep patch to move
>> btrfs_submit_bio into a new bio.c file, but I only want to do that
>> if we have agreement on the move as the conflicts will be painful
>> when rebasing.
> 
> any comments on this question?  Should I just keep adding this code
> to volumes.c?  Or create a new bio.c?  If so I could send out a
> small prep series to do the move of the existing code ASAP.
> 

I personally am in favor of creating a bio.c file. This would make the
code easier to follow (both in volumes.c and in the then new bio.c).

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: code placement for bio / storage layer code
  2022-09-07  9:10 ` code placement for bio / storage layer code Christoph Hellwig
  2022-09-07  9:46   ` Johannes Thumshirn
@ 2022-09-07 10:28   ` Qu Wenruo
  2022-09-07 11:10     ` Christoph Hellwig
  2022-10-10  8:01   ` Johannes Thumshirn
  2 siblings, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-09-07 10:28 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, Johannes Thumshirn, Qu Wenruo, linux-btrfs



On 2022/9/7 17:10, Christoph Hellwig wrote:
> Hi all,
>
> On Thu, Sep 01, 2022 at 10:41:59AM +0300, Christoph Hellwig wrote:
>> Note: this adds a fair amount of code to volumes.c, which already is
>> quite large.  It might make sense to add a prep patch to move
>> btrfs_submit_bio into a new bio.c file, but I only want to do that
>> if we have agreement on the move as the conflicts will be painful
>> when rebasing.
>
> any comments on this question?  Should I just keep adding this code
> to volumes.c?  Or create a new bio.c?  If so I could send out a
> small prep series to do the move of the existing code ASAP.

I'm pretty happy with a new file.

But before that, I want something more guide lines on what to put into
the two files.

To me, the old volumes should really only contain the chunk tree related
code (read, add, delete a chunk), thus it may be better renamed to
somethings like chunks.c?

Then the storage layer code should be the lower level code mostly
touching the bio.

BTW, we may also want to extract a lot of code from extent_io.c to that
new storage layer file.


But I'm not sure if the bio.c is really the best name.
What about storage.c?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: code placement for bio / storage layer code
  2022-09-07 10:28   ` Qu Wenruo
@ 2022-09-07 11:10     ` Christoph Hellwig
  2022-09-07 11:27       ` Qu Wenruo
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-07 11:10 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, linux-btrfs

On Wed, Sep 07, 2022 at 06:28:05PM +0800, Qu Wenruo wrote:
> To me, the old volumes should really only contain the chunk tree related
> code (read, add, delete a chunk), thus it may be better renamed to
> somethings like chunks.c?

I'll leave that question to folks who know that area of code much better.

> Then the storage layer code should be the lower level code mostly
> touching the bio.

For the initial version just doing the move, this would be

 - btrfs_submit_bio
 - btrfs_submit_mirrored_bio
 - btrfs_submit_dev_bio
 - btrfs_clone_write_end_io
 - btrfs_orig_write_end_io
 - btrfs_raid56_end_io
 - btrfs_simple_end_io
 - btrfs_end_bio_work
 - btrfs_end_io_wq
 - btrfs_log_dev_io_error
 - btrfs_bio_clone_partial
 - btrfs_bio_alloc
 - btrfs_bio_init
 - btrfs_bioset_init
 - btrfs_bioset_exit

> BTW, we may also want to extract a lot of code from extent_io.c to that
> new storage layer file.

Yes, this series moves a fair chunk to volumes.c that should go into
the new file instead, and there might be a few more bits.

> But I'm not sure if the bio.c is really the best name.
> What about storage.c?

I'm fine either way with a slight preference for bio.c.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: code placement for bio / storage layer code
  2022-09-07 11:10     ` Christoph Hellwig
@ 2022-09-07 11:27       ` Qu Wenruo
  2022-09-07 11:35         ` Christoph Hellwig
  0 siblings, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-09-07 11:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, linux-btrfs



On 2022/9/7 19:10, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 06:28:05PM +0800, Qu Wenruo wrote:
>> To me, the old volumes should really only contain the chunk tree related
>> code (read, add, delete a chunk), thus it may be better renamed to
>> somethings like chunks.c?
>
> I'll leave that question to folks who know that area of code much better.
>
>> Then the storage layer code should be the lower level code mostly
>> touching the bio.
>
> For the initial version just doing the move, this would be
>
>   - btrfs_submit_bio
>   - btrfs_submit_mirrored_bio
>   - btrfs_submit_dev_bio
>   - btrfs_clone_write_end_io
>   - btrfs_orig_write_end_io
>   - btrfs_raid56_end_io

This is scrub only usage, I guess we may find a better way to determine
if it should go there.

>   - btrfs_simple_end_io
>   - btrfs_end_bio_work
>   - btrfs_end_io_wq
>   - btrfs_log_dev_io_error
>   - btrfs_bio_clone_partial
>   - btrfs_bio_alloc
>   - btrfs_bio_init
>   - btrfs_bioset_init
>   - btrfs_bioset_exit

Otherwise looks pretty good to me.

Thanks,
Qu
>
>> BTW, we may also want to extract a lot of code from extent_io.c to that
>> new storage layer file.
>
> Yes, this series moves a fair chunk to volumes.c that should go into
> the new file instead, and there might be a few more bits.
>
>> But I'm not sure if the bio.c is really the best name.
>> What about storage.c?
>
> I'm fine either way with a slight preference for bio.c.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: code placement for bio / storage layer code
  2022-09-07 11:27       ` Qu Wenruo
@ 2022-09-07 11:35         ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-07 11:35 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, linux-btrfs

On Wed, Sep 07, 2022 at 07:27:48PM +0800, Qu Wenruo wrote:
>>   - btrfs_submit_bio
>>   - btrfs_submit_mirrored_bio
>>   - btrfs_submit_dev_bio
>>   - btrfs_clone_write_end_io
>>   - btrfs_orig_write_end_io
>>   - btrfs_raid56_end_io
>
> This is scrub only usage, I guess we may find a better way to determine
> if it should go there.

Nothing of this is scrub only.  All the above are called by
btrfs_submit_bio, or set up as end_io handlers by them.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 01/17] block: export bio_split_rw
  2022-09-01  7:42 ` [PATCH 01/17] block: export bio_split_rw Christoph Hellwig
  2022-09-01  8:02   ` Johannes Thumshirn
  2022-09-01  8:54   ` Qu Wenruo
@ 2022-09-07 17:51   ` Josef Bacik
  2 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 17:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:00AM +0300, Christoph Hellwig wrote:
> bio_split_rw can be used by file systems to split and incoming write
> bio into multiple bios fitting the hardware limit for use as ZONE_APPEND
> bios.  Export it for initial use in btrfs.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 02/17] btrfs: stop tracking failed reads in the I/O tree
  2022-09-01  7:42 ` [PATCH 02/17] btrfs: stop tracking failed reads in the I/O tree Christoph Hellwig
  2022-09-01  8:55   ` Qu Wenruo
@ 2022-09-07 17:52   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 17:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:01AM +0300, Christoph Hellwig wrote:
> There is a separate I/O failure tree to track the fail reads, so remove
> the extra EXTENT_DAMAGED bit in the I/O tree.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Baacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 03/17] btrfs: move repair_io_failure to volumes.c
  2022-09-01  7:42 ` [PATCH 03/17] btrfs: move repair_io_failure to volumes.c Christoph Hellwig
@ 2022-09-07 17:54   ` Josef Bacik
  0 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 17:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:02AM +0300, Christoph Hellwig wrote:
> repair_io_failure ties directly into all the glory low-level details of
> mapping a bio with a logic address to the actual physical location.
> Move it right below btrfs_submit_bio to keep all the related logic
> together.
> 
> Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that
> repair_io_failure is available in a header.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-01  7:42 ` [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer Christoph Hellwig
  2022-09-01  9:04   ` Qu Wenruo
@ 2022-09-07 18:15   ` Josef Bacik
  2022-09-12 13:57     ` Christoph Hellwig
  1 sibling, 1 reply; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 18:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:03AM +0300, Christoph Hellwig wrote:
> Currently btrfs handles checksum validation and repair in the end I/O
> handler for the btrfs_bio.  This leads to a lot of duplicate code
> plus issues with variying semantics or bugs, e.g.
> 
>  - the until recently completetly broken repair for compressed extents
>  - the fact that encoded reads validate the checksums but do not kick
>    of read repair
>  - the inconsistent checking of the BTRFS_FS_STATE_NO_CSUMS flag
> 
> This commit revamps the checksum validation and repair code to instead
> work below the btrfs_submit_bio interfaces.  For this to work we need
> to make sure an inode is available, so that is added as a parameter
> to btrfs_bio_alloc.  With that btrfs_submit_bio can preload
> btrfs_bio.csum from the csum tree without help from the upper layers,
> and the low-level I/O completion can iterate over the bio and verify
> the checksums.
> 
> In case of a checksum failure (or a plain old I/O error), the repair
> is now kicked off before the upper level ->end_io handler is invoked.
> Tracking of the repair status is massively simplified by just keeping
> a small failed_bio structure per bio with failed sectors and otherwise
> using the information in the repair bio.  The per-inode I/O failure
> tree can be entirely removed.
> 
> The saved bvec_iter in the btrfs_bio is now competely managed by
> btrfs_submit_bio and must not be accessed by the callers.
> 
> There is one significant behavior change here:  If repair fails or
> is impossible to start with, the whole bio will be failed to the
> upper layer.  This is the behavior that all I/O submitters execept
> for buffered I/O already emulated in their end_io handler.  For
> buffered I/O this now means that a large readahead request can
> fail due to a single bad sector, but as readahead errors are igored
> the following readpage if the sector is actually accessed will
> still be able to read.  This also matches the I/O failure handling
> in other file systems.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Generally the change itself is fine, but there's several whitespace errors.
Additionally this is sort of massive, I would prefer if you added the
functionality, removing the various calls to the old io failure rec stuff, and
then had a follow up patch to remove the old io failure code.  This makes it
easier for reviewers to parse what is important to pay attention to and what can
easily be ignored.  Clearly I've already reviewed it, but if you rework it more
than fixing the whitespace issues it would be nice to split the changes into
two.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 05/17] btrfs: handle checksum generation in the storage layer
  2022-09-01  7:42 ` [PATCH 05/17] btrfs: handle checksum generation in " Christoph Hellwig
@ 2022-09-07 20:33   ` Josef Bacik
  0 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 20:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:04AM +0300, Christoph Hellwig wrote:
> Instead of letting the callers of btrfs_submit_bio deal with checksumming
> the (meta)data in the bio and making decisions on when to offload the
> checksumming to the bio, leave that to btrfs_submit_bio.  Do do so the
> existing btrfs_submit_bio function is split into an upper and a lower
> half, so that the lower half can be offloaded to a workqueue.
> 
> The driver-private REQ_DRV flag is used to indicate the special 'bio must
> be contained in a single ordered extent case' that is used by the
> compressed write case instead of passing a new flag all the way down the
> stack.
> 
> Note that this changes the behavior for direct writes to raid56 volumes so
> that async checksum offloading is not skipped when more I/O is expected.
> This runs counter to the argument explaining why it was done, although I
> can't measure any affects of the change.  Commits later in this series
> will make sure the entire direct writes is offloaded to the workqueue
> at once and thus make sure it is sent to the raid56 code from a single
> thread.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/compression.c |  13 +--
>  fs/btrfs/ctree.h       |   4 +-
>  fs/btrfs/disk-io.c     | 170 ++-------------------------------
>  fs/btrfs/disk-io.h     |   5 -
>  fs/btrfs/extent_io.h   |   3 -
>  fs/btrfs/file-item.c   |  25 ++---
>  fs/btrfs/inode.c       |  89 +-----------------
>  fs/btrfs/volumes.c     | 208 ++++++++++++++++++++++++++++++++++++-----
>  fs/btrfs/volumes.h     |   7 +-
>  9 files changed, 215 insertions(+), 309 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index f932415a4f1df..53f9e123712b0 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -351,9 +351,9 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>  	u64 cur_disk_bytenr = disk_start;
>  	u64 next_stripe_start;
>  	blk_status_t ret = BLK_STS_OK;
> -	int skip_sum = inode->flags & BTRFS_INODE_NODATASUM;
>  	const bool use_append = btrfs_use_zone_append(inode, disk_start);
> -	const enum req_op bio_op = use_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE;
> +	const enum req_op bio_op = REQ_BTRFS_ONE_ORDERED |
> +		(use_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE);
>  

I'd rather see this as a separate change.  Keeping logical changes to themselves
makes it easier to figure out what was going on when we look back at the
history.  Other than that you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/17] btrfs: handle recording of zoned writes in the storage layer
  2022-09-01  7:42 ` [PATCH 06/17] btrfs: handle recording of zoned writes " Christoph Hellwig
  2022-09-01  9:44   ` Johannes Thumshirn
@ 2022-09-07 20:36   ` Josef Bacik
  2022-09-12  6:11   ` Naohiro Aota
  2 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 20:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:05AM +0300, Christoph Hellwig wrote:
> Move the code that splits the ordered extents and records the physical
> location for them to the storage layer so that the higher level consumers
> don't have to care about physical block numbers at all.  This will also
> allow to eventually remove accounting for the zone append write sizes in
> the upper layer with a little bit more block layer work.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios
  2022-09-01  7:42 ` [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios Christoph Hellwig
  2022-09-01  9:47   ` Johannes Thumshirn
@ 2022-09-07 20:55   ` Josef Bacik
  2022-09-12 13:58     ` Christoph Hellwig
  2022-09-12  0:20   ` Qu Wenruo
  2 siblings, 1 reply; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 20:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel, osandov

On Thu, Sep 01, 2022 at 10:42:06AM +0300, Christoph Hellwig wrote:
> Currently the I/O submitters have to split bios according to the
> chunk stripe boundaries.  This leads to extra lookups in the extent
> trees and a lot of boilerplate code.
> 
> To drop this requirement, split the bio when __btrfs_map_block
> returns a mapping that is smaller than the requested size and
> keep a count of pending bios in the original btrfs_bio so that
> the upper level completion is only invoked when all clones have
> completed.
> 
> Based on a patch from Qu Wenruo.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/volumes.c | 106 +++++++++++++++++++++++++++++++++++++--------
>  fs/btrfs/volumes.h |   1 +
>  2 files changed, 90 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 5c6535e10085d..0a2d144c20604 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -35,6 +35,7 @@
>  #include "zoned.h"
>  
>  static struct bio_set btrfs_bioset;
> +static struct bio_set btrfs_clone_bioset;
>  static struct bio_set btrfs_repair_bioset;
>  static mempool_t btrfs_failed_bio_pool;
>  
> @@ -6661,6 +6662,7 @@ static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
>  	bbio->inode = inode;
>  	bbio->end_io = end_io;
>  	bbio->private = private;
> +	atomic_set(&bbio->pending_ios, 1);
>  }
>  
>  /*
> @@ -6698,6 +6700,57 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
>  	return bio;
>  }
>  
> +static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
> +{
> +	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
> +	struct bio *bio;
> +
> +	bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
> +			&btrfs_clone_bioset);
> +	btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode, NULL, orig_bbio);
> +
> +	btrfs_bio(bio)->file_offset = orig_bbio->file_offset;
> +	orig_bbio->file_offset += map_length;

I'm worried about this for the ONE_ORDERED case.  We specifically used the
ONE_ORDERED thing because our file_offset was the start, but our length could go
past the range of the ordered extent, and then we wouldn't find our ordered
extent and things would go quite wrong.

Instead we should do something like

if (!(orig->bi_opf & REQ_BTRFS_ONE_ORDERED))
	orig_bbio->file_offset += map_length;

I've cc'ed Omar since he's the one who added this and I'm a little confused
about how this can happen.

> +
> +	atomic_inc(&orig_bbio->pending_ios);
> +	return bio;
> +}
> +
> +static void btrfs_orig_write_end_io(struct bio *bio);
> +static void btrfs_bbio_propagate_error(struct btrfs_bio *bbio,
> +				       struct btrfs_bio *orig_bbio)
> +{
> +	/*
> +	 * For writes btrfs tolerates nr_mirrors - 1 write failures, so we
> +	 * can't just blindly propagate a write failure here.
> +	 * Instead increment the error count in the original I/O context so
> +	 * that it is guaranteed to be larger than the error tolerance.
> +	 */
> +	if (bbio->bio.bi_end_io == &btrfs_orig_write_end_io) {
> +		struct btrfs_io_stripe *orig_stripe = orig_bbio->bio.bi_private;
> +		struct btrfs_io_context *orig_bioc = orig_stripe->bioc;
> +		

Whitespace error here.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 08/17] btrfs: pass the iomap bio to btrfs_submit_bio
  2022-09-01  7:42 ` [PATCH 08/17] btrfs: pass the iomap bio to btrfs_submit_bio Christoph Hellwig
@ 2022-09-07 21:00   ` Josef Bacik
  0 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:07AM +0300, Christoph Hellwig wrote:
> Now that btrfs_submit_bio splits the bio when crossing stripe boundaries,
> there is no need for the higher level code to do that manually.
> 
> For direct I/O this is really helpful, as btrfs_submit_io can now simply
> take the bio allocated by iomap and send it on to btrfs_submit_bio
> instead of allocating clones.
> 
> For that to work, the bio embedded into struct btrfs_dio_private needs to
> become a full btrfs_bio as expected by btrfs_submit_bio.
> 
> With this change there is a single work item to offload the entire iomap
> bio so the heuristics to skip async processing for bios that were split
> isn't needed anymore either.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 09/17] btrfs: remove stripe boundary calculation for buffered I/O
  2022-09-01  7:42 ` [PATCH 09/17] btrfs: remove stripe boundary calculation for buffered I/O Christoph Hellwig
@ 2022-09-07 21:04   ` Josef Bacik
  0 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:08AM +0300, Christoph Hellwig wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> Remove btrfs_bio_ctrl::len_to_stripe_boundary, so that buffer
> I/O will no longer limit its bio size according to stripe length
> now that btrfs_submit_bio can split bios at stripe boundaries.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> [hch: simplify calc_bio_boundaries a little more]
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O
  2022-09-01  7:42 ` [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O Christoph Hellwig
  2022-09-01  9:56   ` Johannes Thumshirn
@ 2022-09-07 21:07   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:09AM +0300, Christoph Hellwig wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> Stop looking at the stripe boundary in alloc_compressed_bio() now that
> that btrfs_submit_bio can split bios, open code the now trivial code
> from alloc_compressed_bio() in btrfs_submit_compressed_read and stop
> maintaining the pending_ios count for reads as there is always just
> a single bio now.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> [hch: remove more cruft in btrfs_submit_compressed_read]
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 11/17] btrfs: remove stripe boundary calculation for encoded I/O
  2022-09-01  7:42 ` [PATCH 11/17] btrfs: remove stripe boundary calculation for encoded I/O Christoph Hellwig
  2022-09-01  9:58   ` Johannes Thumshirn
@ 2022-09-07 21:08   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:10AM +0300, Christoph Hellwig wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> Stop looking at the stripe boundary in
> btrfs_encoded_read_regular_fill_pages() now that that btrfs_submit_bio
> can split bios.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 12/17] btrfs: remove struct btrfs_io_geometry
  2022-09-01  7:42 ` [PATCH 12/17] btrfs: remove struct btrfs_io_geometry Christoph Hellwig
@ 2022-09-07 21:10   ` Josef Bacik
  0 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:11AM +0300, Christoph Hellwig wrote:
> Now that btrfs_get_io_geometry has a single caller, we can massage it
> into a form that is more suitable for that caller and remove the
> marshalling into and out of struct btrfs_io_geometry.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 13/17] btrfs: remove submit_encoded_read_bio
  2022-09-01  7:42 ` [PATCH 13/17] btrfs: remove submit_encoded_read_bio Christoph Hellwig
  2022-09-01 10:02   ` Johannes Thumshirn
@ 2022-09-07 21:11   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:12AM +0300, Christoph Hellwig wrote:
> Just opencode the functionality in the only caller and remove the
> now superflous error handling there.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 14/17] btrfs: remove now spurious bio submission helpers
  2022-09-01  7:42 ` [PATCH 14/17] btrfs: remove now spurious bio submission helpers Christoph Hellwig
  2022-09-01 10:14   ` Johannes Thumshirn
@ 2022-09-07 21:12   ` Josef Bacik
  1 sibling, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:13AM +0300, Christoph Hellwig wrote:
> Just call btrfs_submit_bio and btrfs_submit_compressed_read directly from
> submit_one_bio now that all additional functionality has moved into
> btrfs_submit_bio.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio
  2022-09-01  7:42 ` [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio Christoph Hellwig
  2022-09-02  1:46   ` Damien Le Moal
  2022-09-05 13:15   ` Johannes Thumshirn
@ 2022-09-07 21:17   ` Josef Bacik
  2 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:15AM +0300, Christoph Hellwig wrote:
> The current btrfs zoned device support is a little cumbersome in the data
> I/O path as it requires the callers to not support more I/O than the
> supported ZONE_APPEND size by the underlying device.  This leads to a lot
> of extra accounting.  Instead change btrfs_submit_bio so that it can take
> write bios of arbitrary size and form from the upper layers, and just
> split them internally to the ZONE_APPEND queue limits.  Then remove all
> the upper layer warts catering to limited write sized on zoned devices,
> including the extra refcount in the compressed_bio.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

tho I'd trust the zoned guys reviews over mine here.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND
  2022-09-01  7:42 ` [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND Christoph Hellwig
  2022-09-01 10:46   ` Johannes Thumshirn
  2022-09-02  1:38   ` Damien Le Moal
@ 2022-09-07 21:18   ` Josef Bacik
  2 siblings, 0 replies; 108+ messages in thread
From: Josef Bacik @ 2022-09-07 21:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:16AM +0300, Christoph Hellwig wrote:
> No users left now that btrfs takes REQ_OP_WRITE bios from iomap and
> splits and converts them to REQ_OP_ZONE_APPEND internally.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios
  2022-09-01  7:42 ` [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios Christoph Hellwig
  2022-09-01  9:47   ` Johannes Thumshirn
  2022-09-07 20:55   ` Josef Bacik
@ 2022-09-12  0:20   ` Qu Wenruo
  2022-09-12 13:55     ` Christoph Hellwig
  2 siblings, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-09-12  0:20 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel



On 2022/9/1 15:42, Christoph Hellwig wrote:
> Currently the I/O submitters have to split bios according to the
> chunk stripe boundaries.  This leads to extra lookups in the extent
> trees and a lot of boilerplate code.
>
> To drop this requirement, split the bio when __btrfs_map_block
> returns a mapping that is smaller than the requested size and
> keep a count of pending bios in the original btrfs_bio so that
> the upper level completion is only invoked when all clones have
> completed.

Sorry for the late reply, but I still have a question related the
chained bio way.

Since we go the chained method, it means, if we hit an error for the
splitted bio, the whole bio will be marked error.

Especially for read bios, that can be a problem (currently only for
RAID10 though), which can affect the read repair behavior.

E.g. we have a 4-disks RAID10 looks like this:

Disk 1 (unreliable): Mirror 1 of logical range [X, X + 64K)
Disk 2 (reliable):   Mirror 2 of logical range [X, X + 64K)
Disk 3 (reliable):   Mirror 1 of logical range [X + 64K, X + 128K)
Disk 4 (unreliable): Mirror 2 of logical range [X + 64K, X + 128K)

And we submit a read for range [X, X + 128K)

The first 64K will use mirror 1, thus reading from Disk 1.
The second 64K will also use mirror 1, thus read from Disk 2.

But the first 64K read failed due to whatever reason, thus we mark the
whole range error, and needs to go repair code.

Note that, the original bio is using mirror 1, thus for read-repair we
can only read from mirror 2 to repair.

But in that case, Disk 4 is also unreliable, if at read-repair time, we
didn't try all mirrors, including the failed one (which is no longer a
reliable mirror_num for all ranges), we may failed to repair some range.

Does the read-repair code now has something to compensate the chained
behavior?

Thanks,
Qu

>
> Based on a patch from Qu Wenruo.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/volumes.c | 106 +++++++++++++++++++++++++++++++++++++--------
>   fs/btrfs/volumes.h |   1 +
>   2 files changed, 90 insertions(+), 17 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 5c6535e10085d..0a2d144c20604 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -35,6 +35,7 @@
>   #include "zoned.h"
>
>   static struct bio_set btrfs_bioset;
> +static struct bio_set btrfs_clone_bioset;
>   static struct bio_set btrfs_repair_bioset;
>   static mempool_t btrfs_failed_bio_pool;
>
> @@ -6661,6 +6662,7 @@ static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
>   	bbio->inode = inode;
>   	bbio->end_io = end_io;
>   	bbio->private = private;
> +	atomic_set(&bbio->pending_ios, 1);
>   }
>
>   /*
> @@ -6698,6 +6700,57 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
>   	return bio;
>   }
>
> +static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
> +{
> +	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
> +	struct bio *bio;
> +
> +	bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
> +			&btrfs_clone_bioset);
> +	btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode, NULL, orig_bbio);
> +
> +	btrfs_bio(bio)->file_offset = orig_bbio->file_offset;
> +	orig_bbio->file_offset += map_length;
> +
> +	atomic_inc(&orig_bbio->pending_ios);
> +	return bio;
> +}
> +
> +static void btrfs_orig_write_end_io(struct bio *bio);
> +static void btrfs_bbio_propagate_error(struct btrfs_bio *bbio,
> +				       struct btrfs_bio *orig_bbio)
> +{
> +	/*
> +	 * For writes btrfs tolerates nr_mirrors - 1 write failures, so we
> +	 * can't just blindly propagate a write failure here.
> +	 * Instead increment the error count in the original I/O context so
> +	 * that it is guaranteed to be larger than the error tolerance.
> +	 */
> +	if (bbio->bio.bi_end_io == &btrfs_orig_write_end_io) {
> +		struct btrfs_io_stripe *orig_stripe = orig_bbio->bio.bi_private;
> +		struct btrfs_io_context *orig_bioc = orig_stripe->bioc;
> +
> +		atomic_add(orig_bioc->max_errors, &orig_bioc->error);
> +	} else {
> +		orig_bbio->bio.bi_status = bbio->bio.bi_status;
> +	}
> +}
> +
> +static void btrfs_orig_bbio_end_io(struct btrfs_bio *bbio)
> +{
> +	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
> +		struct btrfs_bio *orig_bbio = bbio->private;
> +
> +		if (bbio->bio.bi_status)
> +			btrfs_bbio_propagate_error(bbio, orig_bbio);
> +		bio_put(&bbio->bio);
> +		bbio = orig_bbio;
> +	}
> +
> +	if (atomic_dec_and_test(&bbio->pending_ios))
> +		bbio->end_io(bbio);
> +}
> +
>   static int next_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
>   {
>   	if (cur_mirror == fbio->num_copies)
> @@ -6715,7 +6768,7 @@ static int prev_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
>   static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
>   {
>   	if (atomic_dec_and_test(&fbio->repair_count)) {
> -		fbio->bbio->end_io(fbio->bbio);
> +		btrfs_orig_bbio_end_io(fbio->bbio);
>   		mempool_free(fbio, &btrfs_failed_bio_pool);
>   	}
>   }
> @@ -6857,7 +6910,7 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio,
>   	if (unlikely(fbio))
>   		btrfs_repair_done(fbio);
>   	else
> -		bbio->end_io(bbio);
> +		btrfs_orig_bbio_end_io(bbio);
>   }
>
>   static void btrfs_log_dev_io_error(struct bio *bio, struct btrfs_device *dev)
> @@ -6908,7 +6961,7 @@ static void btrfs_simple_end_io(struct bio *bio)
>   	} else {
>   		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
>   			btrfs_record_physical_zoned(bbio);
> -		bbio->end_io(bbio);
> +		btrfs_orig_bbio_end_io(bbio);
>   	}
>   }
>
> @@ -6922,7 +6975,7 @@ static void btrfs_raid56_end_io(struct bio *bio)
>   	if (bio_op(bio) == REQ_OP_READ)
>   		btrfs_check_read_bio(bbio, NULL);
>   	else
> -		bbio->end_io(bbio);
> +		btrfs_orig_bbio_end_io(bbio);
>
>   	btrfs_put_bioc(bioc);
>   }
> @@ -6949,7 +7002,7 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>   	else
>   		bio->bi_status = BLK_STS_OK;
>
> -	bbio->end_io(bbio);
> +	btrfs_orig_bbio_end_io(bbio);
>   	btrfs_put_bioc(bioc);
>   }
>
> @@ -7190,8 +7243,8 @@ static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio,
>   	return true;
>   }
>
> -void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
> -		      int mirror_num)
> +static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
> +			       int mirror_num)
>   {
>   	struct btrfs_bio *bbio = btrfs_bio(bio);
>   	u64 logical = bio->bi_iter.bi_sector << 9;
> @@ -7207,11 +7260,10 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   	if (ret)
>   		goto fail;
>
> +	map_length = min(map_length, length);
>   	if (map_length < length) {
> -		btrfs_crit(fs_info,
> -			   "mapping failed logical %llu bio len %llu len %llu",
> -			   logical, length, map_length);
> -		BUG();
> +		bio = btrfs_split_bio(bio, map_length);
> +		bbio = btrfs_bio(bio);
>   	}
>
>   	/*
> @@ -7222,7 +7274,7 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   		bbio->saved_iter = bio->bi_iter;
>   		ret = btrfs_lookup_bio_sums(bbio);
>   		if (ret)
> -			goto fail;
> +			goto fail_put_bio;
>   	}
>
>   	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
> @@ -7231,7 +7283,7 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>   			ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
>   			if (ret)
> -				goto fail;
> +				goto fail_put_bio;
>   		}
>
>   		/*
> @@ -7243,22 +7295,36 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   		    !btrfs_is_data_reloc_root(bi->root)) {
>   			if (should_async_write(bbio) &&
>   			    btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
> -				return;
> +				goto done;
>
>   			if (bio->bi_opf & REQ_META)
>   				ret = btree_csum_one_bio(bbio);
>   			else
>   				ret = btrfs_csum_one_bio(bbio);
>   			if (ret)
> -				goto fail;
> +				goto fail_put_bio;
>   		}
>   	}
>
>   	__btrfs_submit_bio(bio, bioc, &smap, mirror_num);
> -	return;
> +done:
> +	return map_length == length;
> +
> +fail_put_bio:
> +	if (map_length < length)
> +		bio_put(bio);
>   fail:
>   	btrfs_bio_counter_dec(fs_info);
>   	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
> +	/* Do not submit another chunk */
> +	return true;
> +}
> +
> +void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
> +		      int mirror_num)
> +{
> +	while (!btrfs_submit_chunk(fs_info, bio, mirror_num))
> +		;
>   }
>
>   /*
> @@ -8858,10 +8924,13 @@ int __init btrfs_bioset_init(void)
>   			offsetof(struct btrfs_bio, bio),
>   			BIOSET_NEED_BVECS))
>   		return -ENOMEM;
> +	if (bioset_init(&btrfs_clone_bioset, BIO_POOL_SIZE,
> +			offsetof(struct btrfs_bio, bio), 0))
> +		goto out_free_bioset;
>   	if (bioset_init(&btrfs_repair_bioset, BIO_POOL_SIZE,
>   			offsetof(struct btrfs_bio, bio),
>   			BIOSET_NEED_BVECS))
> -		goto out_free_bioset;
> +		goto out_free_clone_bioset;
>   	if (mempool_init_kmalloc_pool(&btrfs_failed_bio_pool, BIO_POOL_SIZE,
>   				      sizeof(struct btrfs_failed_bio)))
>   		goto out_free_repair_bioset;
> @@ -8869,6 +8938,8 @@ int __init btrfs_bioset_init(void)
>
>   out_free_repair_bioset:
>   	bioset_exit(&btrfs_repair_bioset);
> +out_free_clone_bioset:
> +	bioset_exit(&btrfs_clone_bioset);
>   out_free_bioset:
>   	bioset_exit(&btrfs_bioset);
>   	return -ENOMEM;
> @@ -8878,5 +8949,6 @@ void __cold btrfs_bioset_exit(void)
>   {
>   	mempool_exit(&btrfs_failed_bio_pool);
>   	bioset_exit(&btrfs_repair_bioset);
> +	bioset_exit(&btrfs_clone_bioset);
>   	bioset_exit(&btrfs_bioset);
>   }
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 8b248c9bd602b..97877184d0db1 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -386,6 +386,7 @@ struct btrfs_bio {
>
>   	/* For internal use in read end I/O handling */
>   	unsigned int mirror_num;
> +	atomic_t pending_ios;
>   	struct work_struct end_io_work;
>
>   	/*

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 06/17] btrfs: handle recording of zoned writes in the storage layer
  2022-09-01  7:42 ` [PATCH 06/17] btrfs: handle recording of zoned writes " Christoph Hellwig
  2022-09-01  9:44   ` Johannes Thumshirn
  2022-09-07 20:36   ` Josef Bacik
@ 2022-09-12  6:11   ` Naohiro Aota
  2 siblings, 0 replies; 108+ messages in thread
From: Naohiro Aota @ 2022-09-12  6:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Johannes Thumshirn, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Thu, Sep 01, 2022 at 10:42:05AM +0300, Christoph Hellwig wrote:
> Move the code that splits the ordered extents and records the physical
> location for them to the storage layer so that the higher level consumers
> don't have to care about physical block numbers at all.  This will also
> allow to eventually remove accounting for the zone append write sizes in
> the upper layer with a little bit more block layer work.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good to me.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios
  2022-09-12  0:20   ` Qu Wenruo
@ 2022-09-12 13:55     ` Christoph Hellwig
  2022-09-12 22:23       ` Qu Wenruo
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-12 13:55 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Mon, Sep 12, 2022 at 08:20:37AM +0800, Qu Wenruo wrote:
> Sorry for the late reply, but I still have a question related the
> chained bio way.
>
> Since we go the chained method, it means, if we hit an error for the
> splitted bio, the whole bio will be marked error.

The only chained bios in the sense of using bio chaining are the
writes to the multiple legs of mirrored volumes.

> Especially for read bios, that can be a problem (currently only for
> RAID10 though), which can affect the read repair behavior.
>
> E.g. we have a 4-disks RAID10 looks like this:
>
> Disk 1 (unreliable): Mirror 1 of logical range [X, X + 64K)
> Disk 2 (reliable):   Mirror 2 of logical range [X, X + 64K)
> Disk 3 (reliable):   Mirror 1 of logical range [X + 64K, X + 128K)
> Disk 4 (unreliable): Mirror 2 of logical range [X + 64K, X + 128K)
>
> And we submit a read for range [X, X + 128K)
>
> The first 64K will use mirror 1, thus reading from Disk 1.
> The second 64K will also use mirror 1, thus read from Disk 2.
>
> But the first 64K read failed due to whatever reason, thus we mark the
> whole range error, and needs to go repair code.

With the code posted in this series that is not what happens.  Instead
the checksum validation and then repair happen when the read from
mirror 1 / disk 1 completes, but before the results are propagated
up.  That was the prime reason why I had to move the repair code
below btrfs_submit_bio (that it happend to removed code and consolidate
the exact behavior is a nice side-effect).

> Does the read-repair code now has something to compensate the chained
> behavior?

It doesn't compensate it, but it is invoked at a low enough level so
that this problem does not happen.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer
  2022-09-07 18:15   ` Josef Bacik
@ 2022-09-12 13:57     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-12 13:57 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, Chris Mason, David Sterba, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Wed, Sep 07, 2022 at 02:15:22PM -0400, Josef Bacik wrote:
> Additionally this is sort of massive, I would prefer if you added the
> functionality, removing the various calls to the old io failure rec
> stuff, and then had a follow up patch to remove the old io failure code.

Hmm.  To do that I'd have to add a new temporary member to btrfs_bio
to signal that the low-level repair code should be used.  If that is
ok with the maintainers I can give it a try.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios
  2022-09-07 20:55   ` Josef Bacik
@ 2022-09-12 13:58     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-09-12 13:58 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, Chris Mason, David Sterba, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel,
	osandov

On Wed, Sep 07, 2022 at 04:55:45PM -0400, Josef Bacik wrote:
> I'm worried about this for the ONE_ORDERED case.  We specifically used the
> ONE_ORDERED thing because our file_offset was the start, but our length could go
> past the range of the ordered extent, and then we wouldn't find our ordered
> extent and things would go quite wrong.
> 
> Instead we should do something like
> 
> if (!(orig->bi_opf & REQ_BTRFS_ONE_ORDERED))
> 	orig_bbio->file_offset += map_length;
> 
> I've cc'ed Omar since he's the one who added this and I'm a little confused
> about how this can happen.

I have to say I found the logic quite confusing as well, and when I
broke it during development of this series xfstests did not complain
either.  So shedding some more light on the flag would be really
helpful.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios
  2022-09-12 13:55     ` Christoph Hellwig
@ 2022-09-12 22:23       ` Qu Wenruo
  0 siblings, 0 replies; 108+ messages in thread
From: Qu Wenruo @ 2022-09-12 22:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel



On 2022/9/12 21:55, Christoph Hellwig wrote:
> On Mon, Sep 12, 2022 at 08:20:37AM +0800, Qu Wenruo wrote:
>> Sorry for the late reply, but I still have a question related the
>> chained bio way.
>>
>> Since we go the chained method, it means, if we hit an error for the
>> splitted bio, the whole bio will be marked error.
>
> The only chained bios in the sense of using bio chaining are the
> writes to the multiple legs of mirrored volumes.
>
>> Especially for read bios, that can be a problem (currently only for
>> RAID10 though), which can affect the read repair behavior.
>>
>> E.g. we have a 4-disks RAID10 looks like this:
>>
>> Disk 1 (unreliable): Mirror 1 of logical range [X, X + 64K)
>> Disk 2 (reliable):   Mirror 2 of logical range [X, X + 64K)
>> Disk 3 (reliable):   Mirror 1 of logical range [X + 64K, X + 128K)
>> Disk 4 (unreliable): Mirror 2 of logical range [X + 64K, X + 128K)
>>
>> And we submit a read for range [X, X + 128K)
>>
>> The first 64K will use mirror 1, thus reading from Disk 1.
>> The second 64K will also use mirror 1, thus read from Disk 2.
>>
>> But the first 64K read failed due to whatever reason, thus we mark the
>> whole range error, and needs to go repair code.
>
> With the code posted in this series that is not what happens.  Instead
> the checksum validation and then repair happen when the read from
> mirror 1 / disk 1 completes, but before the results are propagated
> up.  That was the prime reason why I had to move the repair code
> below btrfs_submit_bio (that it happend to removed code and consolidate
> the exact behavior is a nice side-effect).
>
>> Does the read-repair code now has something to compensate the chained
>> behavior?
>
> It doesn't compensate it, but it is invoked at a low enough level so
> that this problem does not happen.

You're completely right, it's the 4th patch putting the verification
code into the endio function, thus the verification is still done
per-splitted-bio.

I really should review the whole series in one go...

Then it looks pretty good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: code placement for bio / storage layer code
  2022-09-07  9:10 ` code placement for bio / storage layer code Christoph Hellwig
  2022-09-07  9:46   ` Johannes Thumshirn
  2022-09-07 10:28   ` Qu Wenruo
@ 2022-10-10  8:01   ` Johannes Thumshirn
  2 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-10-10  8:01 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, Qu Wenruo, linux-btrfs

On 07.09.22 11:11, Christoph Hellwig wrote:
> Hi all,
> 
> On Thu, Sep 01, 2022 at 10:41:59AM +0300, Christoph Hellwig wrote:
>> Note: this adds a fair amount of code to volumes.c, which already is
>> quite large.  It might make sense to add a prep patch to move
>> btrfs_submit_bio into a new bio.c file, but I only want to do that
>> if we have agreement on the move as the conflicts will be painful
>> when rebasing.
> 
> any comments on this question?  Should I just keep adding this code
> to volumes.c?  Or create a new bio.c?  If so I could send out a
> small prep series to do the move of the existing code ASAP.
> 


Is there any plans how to proceed with this patchset?

6.1 PR got pulled by Linus, so 6.2 development should start soon 
with rc1.

	Johannes

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
                   ` (18 preceding siblings ...)
  2022-09-07  9:10 ` code placement for bio / storage layer code Christoph Hellwig
@ 2022-10-24  8:12 ` Johannes Thumshirn
  2022-10-24  8:20   ` Qu Wenruo
  2022-10-24 14:44   ` Christoph Hellwig
  19 siblings, 2 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-10-24  8:12 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

David, what's your plan to progress with this series?


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24  8:12 ` consolidate btrfs checksumming, repair and bio splitting Johannes Thumshirn
@ 2022-10-24  8:20   ` Qu Wenruo
  2022-10-24  9:07     ` Johannes Thumshirn
  2022-10-24 14:44   ` Christoph Hellwig
  1 sibling, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-10-24  8:20 UTC (permalink / raw)
  To: Johannes Thumshirn, Christoph Hellwig, Chris Mason, Josef Bacik,
	David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel



On 2022/10/24 16:12, Johannes Thumshirn wrote:
> David, what's your plan to progress with this series?
>

Initially David wants me to do some fixup in my spare time, but I know
your RST feature is depending on this.

If you're urgent on this series, I guess I can put it with more priority.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24  8:20   ` Qu Wenruo
@ 2022-10-24  9:07     ` Johannes Thumshirn
  2022-10-24  9:18       ` Qu Wenruo
  0 siblings, 1 reply; 108+ messages in thread
From: Johannes Thumshirn @ 2022-10-24  9:07 UTC (permalink / raw)
  To: Qu Wenruo, Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On 24.10.22 10:20, Qu Wenruo wrote:
> 
> 
> On 2022/10/24 16:12, Johannes Thumshirn wrote:
>> David, what's your plan to progress with this series?
>>
> 
> Initially David wants me to do some fixup in my spare time, but I know
> your RST feature is depending on this.
> 
> If you're urgent on this series, I guess I can put it with more priority.

What's the fixups needed there? I haven't seen a mail from David about it.

I've quickly skimmed over the comments and it seems like Josef is mostly fine
with it.

I can continue working on it as well, but as this series contains code from both
you and Christoph I don't think I should be the 3rd person working on it.

But if it's needed, I can of cause do.

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24  9:07     ` Johannes Thumshirn
@ 2022-10-24  9:18       ` Qu Wenruo
  2022-10-24 10:21         ` Johannes Thumshirn
  0 siblings, 1 reply; 108+ messages in thread
From: Qu Wenruo @ 2022-10-24  9:18 UTC (permalink / raw)
  To: Johannes Thumshirn, Christoph Hellwig, Chris Mason, Josef Bacik,
	David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel



On 2022/10/24 17:07, Johannes Thumshirn wrote:
> On 24.10.22 10:20, Qu Wenruo wrote:
>>
>>
>> On 2022/10/24 16:12, Johannes Thumshirn wrote:
>>> David, what's your plan to progress with this series?
>>>
>>
>> Initially David wants me to do some fixup in my spare time, but I know
>> your RST feature is depending on this.
>>
>> If you're urgent on this series, I guess I can put it with more priority.
>
> What's the fixups needed there? I haven't seen a mail from David about it.

Mostly to fixup some comments/commit messages.

Sorry, that's an off list talk, thus not in the mailing list.

>
> I've quickly skimmed over the comments and it seems like Josef is mostly fine
> with it.
>
> I can continue working on it as well, but as this series contains code from both
> you and Christoph I don't think I should be the 3rd person working on it.
>
> But if it's needed, I can of cause do.

That would be very kind, as I'm still fighting with raid56 code, and
won't be able to work on this series immediately.

Thanks,
Qu

>
> Byte,
> 	Johannes

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24  9:18       ` Qu Wenruo
@ 2022-10-24 10:21         ` Johannes Thumshirn
  0 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-10-24 10:21 UTC (permalink / raw)
  To: Qu Wenruo, Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On 24.10.22 11:18, Qu Wenruo wrote:
> 
> 
> On 2022/10/24 17:07, Johannes Thumshirn wrote:
>> On 24.10.22 10:20, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/10/24 16:12, Johannes Thumshirn wrote:
>>>> David, what's your plan to progress with this series?
>>>>
>>>
>>> Initially David wants me to do some fixup in my spare time, but I know
>>> your RST feature is depending on this.
>>>
>>> If you're urgent on this series, I guess I can put it with more priority.
>>
>> What's the fixups needed there? I haven't seen a mail from David about it.
> 
> Mostly to fixup some comments/commit messages.
> 
> Sorry, that's an off list talk, thus not in the mailing list.
> 
>>
>> I've quickly skimmed over the comments and it seems like Josef is mostly fine
>> with it.
>>
>> I can continue working on it as well, but as this series contains code from both
>> you and Christoph I don't think I should be the 3rd person working on it.
>>
>> But if it's needed, I can of cause do.
> 
> That would be very kind, as I'm still fighting with raid56 code, and
> won't be able to work on this series immediately.

OK, I'll go over my latest RST series and incorporate Josef's comments there and
afterwards I'll take care of this one.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24  8:12 ` consolidate btrfs checksumming, repair and bio splitting Johannes Thumshirn
  2022-10-24  8:20   ` Qu Wenruo
@ 2022-10-24 14:44   ` Christoph Hellwig
  2022-10-24 15:25     ` Chris Mason
  1 sibling, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2022-10-24 14:44 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:
> David, what's your plan to progress with this series?

FYI, I object to merging any of my code into btrfs without a proper
copyright notice, and I also need to find some time to remove my
previous significant changes given that the btrfs maintainer
refuses to take the proper and legally required copyright notice.

So don't waste any of your time on this.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24 14:44   ` Christoph Hellwig
@ 2022-10-24 15:25     ` Chris Mason
  2022-10-24 17:10       ` David Sterba
  0 siblings, 1 reply; 108+ messages in thread
From: Chris Mason @ 2022-10-24 15:25 UTC (permalink / raw)
  To: Christoph Hellwig, Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On 10/24/22 10:44 AM, Christoph Hellwig wrote:
> On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:
>> David, what's your plan to progress with this series?
> 
> FYI, I object to merging any of my code into btrfs without a proper
> copyright notice, and I also need to find some time to remove my
> previous significant changes given that the btrfs maintainer
> refuses to take the proper and legally required copyright notice.
> 
> So don't waste any of your time on this.

Christoph's request is well within the norms for the kernel, given that 
he's making substantial changes to these files.  I talked this over with 
GregKH, who pointed me at:

https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects

Even if we'd taken up some of the other policies suggested by this doc, 
I'd still defer to preferences of developers who have made significant 
changes.

-chris


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24 15:25     ` Chris Mason
@ 2022-10-24 17:10       ` David Sterba
  2022-10-24 17:34         ` Chris Mason
  2022-10-26  7:36         ` Johannes Thumshirn
  0 siblings, 2 replies; 108+ messages in thread
From: David Sterba @ 2022-10-24 17:10 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Johannes Thumshirn, Chris Mason, Josef Bacik,
	David Sterba, Damien Le Moal, Naohiro Aota, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Mon, Oct 24, 2022 at 11:25:04AM -0400, Chris Mason wrote:
> On 10/24/22 10:44 AM, Christoph Hellwig wrote:
> > On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:
> >> David, what's your plan to progress with this series?
> > 
> > FYI, I object to merging any of my code into btrfs without a proper
> > copyright notice, and I also need to find some time to remove my
> > previous significant changes given that the btrfs maintainer
> > refuses to take the proper and legally required copyright notice.
> > 
> > So don't waste any of your time on this.
> 
> Christoph's request is well within the norms for the kernel, given that 
> he's making substantial changes to these files.  I talked this over with 
> GregKH, who pointed me at:
> 
> https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects
> 
> Even if we'd taken up some of the other policies suggested by this doc, 
> I'd still defer to preferences of developers who have made significant 
> changes.

I've asked for recommendations or best practice similar to the SPDX
process. Something that TAB can acknowledge and that is perhaps also
consulted with lawyers. And understood within the linux project,
not just that some dudes have an argument because it's all clear as mud
and people are used to do things differently.

The link from linux foundation blog is nice but unless this is codified
into the process it's just somebody's blog post. Also there's a paragraph
about "Why not list every copyright holder?" that covers several points
why I don't want to do that.

But, if TAB says so I will do, perhaps spending hours of unproductive
time looking up the whole history of contributors and adding year, name,
company whatever to files.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24 17:10       ` David Sterba
@ 2022-10-24 17:34         ` Chris Mason
  2022-10-24 22:18           ` Damien Le Moal
  2022-10-26  7:36         ` Johannes Thumshirn
  1 sibling, 1 reply; 108+ messages in thread
From: Chris Mason @ 2022-10-24 17:34 UTC (permalink / raw)
  To: dsterba
  Cc: Christoph Hellwig, Johannes Thumshirn, Chris Mason, Josef Bacik,
	David Sterba, Damien Le Moal, Naohiro Aota, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 10/24/22 1:10 PM, David Sterba wrote:
> On Mon, Oct 24, 2022 at 11:25:04AM -0400, Chris Mason wrote:
>> On 10/24/22 10:44 AM, Christoph Hellwig wrote:
>>> On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:
>>>> David, what's your plan to progress with this series?
>>>
>>> FYI, I object to merging any of my code into btrfs without a proper
>>> copyright notice, and I also need to find some time to remove my
>>> previous significant changes given that the btrfs maintainer
>>> refuses to take the proper and legally required copyright notice.
>>>
>>> So don't waste any of your time on this.
>>
>> Christoph's request is well within the norms for the kernel, given that
>> he's making substantial changes to these files.  I talked this over with
>> GregKH, who pointed me at:
>>
>> https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects
>>
>> Even if we'd taken up some of the other policies suggested by this doc,
>> I'd still defer to preferences of developers who have made significant
>> changes.
> 
> I've asked for recommendations or best practice similar to the SPDX
> process. Something that TAB can acknowledge and that is perhaps also
> consulted with lawyers. And understood within the linux project,
> not just that some dudes have an argument because it's all clear as mud
> and people are used to do things differently.

The LF in general doesn't give legal advice, but the link above does 
help describe common practices.

It's up to us to bring in our own lawyers and make decisions about the 
kinds of changes we're willing to accept.  We could ask the TAB (btw, 
I'm no longer on the TAB) to weigh in, but I think we'll find the normal 
variety of answers based on subsystem.

It's also up to contributors to decide on what kinds of requirements 
they want to place on continued participation.  Individuals and 
corporations have their own preferences based on advice from their 
lawyers, and as long as the change is significant, I think we can and 
should honor their wishes.

Does this mean going through and retroactively adding copyright lines? 
I'd really rather not.  If a major contributor comes in and shows a long 
list of commits and asks for a copyright line, I personally would say yes.

> 
> The link from linux foundation blog is nice but unless this is codified
> into the process it's just somebody's blog post. Also there's a paragraph
> about "Why not list every copyright holder?" that covers several points
> why I don't want to do that.

I'm also happy to gather advice about following the suggestions in the 
LF post.  I understand your concerns about listing every copyright 
holder, but I don't think this has been a major problem in the kernel in 
general.

> 
> But, if TAB says so I will do, perhaps spending hours of unproductive
> time looking up the whole history of contributors and adding year, name,
> company whatever to files.

I can't imagine anyone asking you to spend time this way.

-chris


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24 17:34         ` Chris Mason
@ 2022-10-24 22:18           ` Damien Le Moal
  0 siblings, 0 replies; 108+ messages in thread
From: Damien Le Moal @ 2022-10-24 22:18 UTC (permalink / raw)
  To: Chris Mason, dsterba
  Cc: Christoph Hellwig, Johannes Thumshirn, Chris Mason, Josef Bacik,
	David Sterba, Damien Le Moal, Naohiro Aota, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 10/25/22 02:34, Chris Mason wrote:
> On 10/24/22 1:10 PM, David Sterba wrote:
>> On Mon, Oct 24, 2022 at 11:25:04AM -0400, Chris Mason wrote:
>>> On 10/24/22 10:44 AM, Christoph Hellwig wrote:
>>>> On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:
>>>>> David, what's your plan to progress with this series?
>>>>
>>>> FYI, I object to merging any of my code into btrfs without a proper
>>>> copyright notice, and I also need to find some time to remove my
>>>> previous significant changes given that the btrfs maintainer
>>>> refuses to take the proper and legally required copyright notice.
>>>>
>>>> So don't waste any of your time on this.
>>>
>>> Christoph's request is well within the norms for the kernel, given that
>>> he's making substantial changes to these files.  I talked this over with
>>> GregKH, who pointed me at:
>>>
>>> https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects
>>>
>>> Even if we'd taken up some of the other policies suggested by this doc,
>>> I'd still defer to preferences of developers who have made significant
>>> changes.
>>
>> I've asked for recommendations or best practice similar to the SPDX
>> process. Something that TAB can acknowledge and that is perhaps also
>> consulted with lawyers. And understood within the linux project,
>> not just that some dudes have an argument because it's all clear as mud
>> and people are used to do things differently.
> 
> The LF in general doesn't give legal advice, but the link above does 
> help describe common practices.
> 
> It's up to us to bring in our own lawyers and make decisions about the 
> kinds of changes we're willing to accept.  We could ask the TAB (btw, 
> I'm no longer on the TAB) to weigh in, but I think we'll find the normal 
> variety of answers based on subsystem.
> 
> It's also up to contributors to decide on what kinds of requirements 
> they want to place on continued participation.  Individuals and 
> corporations have their own preferences based on advice from their 
> lawyers, and as long as the change is significant, I think we can and 
> should honor their wishes.
> 
> Does this mean going through and retroactively adding copyright lines? 
> I'd really rather not.  If a major contributor comes in and shows a long 
> list of commits and asks for a copyright line, I personally would say yes.

I am not aware of any long list of copyright holders in kernel source code
files. I personally thought that the most common practice is to add a
copyright notice for the creator (or his/her employer) of a new source
file, or if for someone who almost completely rewrite a file. That is I
think perfectly acceptable, as adding a new file generally means that a
contribution is substantial.

> 
>>
>> The link from linux foundation blog is nice but unless this is codified
>> into the process it's just somebody's blog post. Also there's a paragraph
>> about "Why not list every copyright holder?" that covers several points
>> why I don't want to do that.
> 
> I'm also happy to gather advice about following the suggestions in the 
> LF post.  I understand your concerns about listing every copyright 
> holder, but I don't think this has been a major problem in the kernel in 
> general.
> 
>>
>> But, if TAB says so I will do, perhaps spending hours of unproductive
>> time looking up the whole history of contributors and adding year, name,
>> company whatever to files.
> 
> I can't imagine anyone asking you to spend time this way.
> 
> -chris
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-24 17:10       ` David Sterba
  2022-10-24 17:34         ` Chris Mason
@ 2022-10-26  7:36         ` Johannes Thumshirn
  2022-10-26 11:41           ` Steven Rostedt
  1 sibling, 1 reply; 108+ messages in thread
From: Johannes Thumshirn @ 2022-10-26  7:36 UTC (permalink / raw)
  To: dsterba, Chris Mason, rostedt
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

[+Cc Steven ]

Steven, you're on the TAB, can you help with this issue?
Or bring it up with other TAB members?

Thanks :)

Full quote below for reference:

On 24.10.22 19:11, David Sterba wrote:
> On Mon, Oct 24, 2022 at 11:25:04AM -0400, Chris Mason wrote:
>> On 10/24/22 10:44 AM, Christoph Hellwig wrote:
>>> On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:
>>>> David, what's your plan to progress with this series?
>>>
>>> FYI, I object to merging any of my code into btrfs without a proper
>>> copyright notice, and I also need to find some time to remove my
>>> previous significant changes given that the btrfs maintainer
>>> refuses to take the proper and legally required copyright notice.
>>>
>>> So don't waste any of your time on this.
>>
>> Christoph's request is well within the norms for the kernel, given that 
>> he's making substantial changes to these files.  I talked this over with 
>> GregKH, who pointed me at:
>>
>> https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects
>>
>> Even if we'd taken up some of the other policies suggested by this doc, 
>> I'd still defer to preferences of developers who have made significant 
>> changes.
> 
> I've asked for recommendations or best practice similar to the SPDX
> process. Something that TAB can acknowledge and that is perhaps also
> consulted with lawyers. And understood within the linux project,
> not just that some dudes have an argument because it's all clear as mud
> and people are used to do things differently.
> 
> The link from linux foundation blog is nice but unless this is codified
> into the process it's just somebody's blog post. Also there's a paragraph
> about "Why not list every copyright holder?" that covers several points
> why I don't want to do that.
> 
> But, if TAB says so I will do, perhaps spending hours of unproductive
> time looking up the whole history of contributors and adding year, name,
> company whatever to files.
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-26  7:36         ` Johannes Thumshirn
@ 2022-10-26 11:41           ` Steven Rostedt
  2022-10-27 13:54             ` Johannes Thumshirn
                               ` (2 more replies)
  0 siblings, 3 replies; 108+ messages in thread
From: Steven Rostedt @ 2022-10-26 11:41 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dsterba, Chris Mason, Christoph Hellwig, Chris Mason,
	Josef Bacik, David Sterba, Damien Le Moal, Naohiro Aota,
	Qu Wenruo, Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Wed, 26 Oct 2022 07:36:45 +0000
Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote:

> [+Cc Steven ]
> 
> Steven, you're on the TAB, can you help with this issue?
> Or bring it up with other TAB members?
> 

Well, Chris Mason was recently the TAB chair.

> Thanks :)
> 
> Full quote below for reference:
> 
> On 24.10.22 19:11, David Sterba wrote:
> > On Mon, Oct 24, 2022 at 11:25:04AM -0400, Chris Mason wrote:  
> >> On 10/24/22 10:44 AM, Christoph Hellwig wrote:  
> >>> On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:  
> >>>> David, what's your plan to progress with this series?  
> >>>
> >>> FYI, I object to merging any of my code into btrfs without a proper
> >>> copyright notice, and I also need to find some time to remove my
> >>> previous significant changes given that the btrfs maintainer
> >>> refuses to take the proper and legally required copyright notice.
> >>>
> >>> So don't waste any of your time on this.  
> >>
> >> Christoph's request is well within the norms for the kernel, given that 
> >> he's making substantial changes to these files.  I talked this over with 
> >> GregKH, who pointed me at:
> >>
> >> https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects
> >>
> >> Even if we'd taken up some of the other policies suggested by this doc, 
> >> I'd still defer to preferences of developers who have made significant 
> >> changes.  
> > 
> > I've asked for recommendations or best practice similar to the SPDX
> > process. Something that TAB can acknowledge and that is perhaps also
> > consulted with lawyers. And understood within the linux project,
> > not just that some dudes have an argument because it's all clear as mud
> > and people are used to do things differently.
> > 
> > The link from linux foundation blog is nice but unless this is codified
> > into the process it's just somebody's blog post. Also there's a paragraph
> > about "Why not list every copyright holder?" that covers several points
> > why I don't want to do that.
> > 
> > But, if TAB says so I will do, perhaps spending hours of unproductive
> > time looking up the whole history of contributors and adding year, name,
> > company whatever to files.

There's no requirement to list every copyright holder, as most developers do
not require it for acceptance. The issue I see here is that there's someone
that does require it for you to accept their code.

The policy is simple. If someone requires a copyright notice for their
code, you simply add it, or do not take their code. You can be specific
about what that code is that is copyrighted. Perhaps just around the code in
question or a description at the top.

Looking over the thread, I'm still confused at what the issue is. Is it
that if you add one copyright notice you must do it for everyone else? Is
everyone else asking for it? If not, just add the one and be done with it.

-- Steve



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-26 11:41           ` Steven Rostedt
@ 2022-10-27 13:54             ` Johannes Thumshirn
  2022-10-31 12:19             ` David Sterba
  2022-11-11 17:57             ` David Sterba
  2 siblings, 0 replies; 108+ messages in thread
From: Johannes Thumshirn @ 2022-10-27 13:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: dsterba, Chris Mason, Christoph Hellwig, Chris Mason,
	Josef Bacik, David Sterba, Damien Le Moal, Naohiro Aota,
	Qu Wenruo, Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 26.10.22 13:41, Steven Rostedt wrote:
> On Wed, 26 Oct 2022 07:36:45 +0000
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote:
> 
>> [+Cc Steven ]
>>
>> Steven, you're on the TAB, can you help with this issue?
>> Or bring it up with other TAB members?
>>
> 
> Well, Chris Mason was recently the TAB chair.
> 
>> Thanks :)
>>
>> Full quote below for reference:
>>
>> On 24.10.22 19:11, David Sterba wrote:
>>> On Mon, Oct 24, 2022 at 11:25:04AM -0400, Chris Mason wrote:  
>>>> On 10/24/22 10:44 AM, Christoph Hellwig wrote:  
>>>>> On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:  
>>>>>> David, what's your plan to progress with this series?  
>>>>>
>>>>> FYI, I object to merging any of my code into btrfs without a proper
>>>>> copyright notice, and I also need to find some time to remove my
>>>>> previous significant changes given that the btrfs maintainer
>>>>> refuses to take the proper and legally required copyright notice.
>>>>>
>>>>> So don't waste any of your time on this.  
>>>>
>>>> Christoph's request is well within the norms for the kernel, given that 
>>>> he's making substantial changes to these files.  I talked this over with 
>>>> GregKH, who pointed me at:
>>>>
>>>> https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects
>>>>
>>>> Even if we'd taken up some of the other policies suggested by this doc, 
>>>> I'd still defer to preferences of developers who have made significant 
>>>> changes.  
>>>
>>> I've asked for recommendations or best practice similar to the SPDX
>>> process. Something that TAB can acknowledge and that is perhaps also
>>> consulted with lawyers. And understood within the linux project,
>>> not just that some dudes have an argument because it's all clear as mud
>>> and people are used to do things differently.
>>>
>>> The link from linux foundation blog is nice but unless this is codified
>>> into the process it's just somebody's blog post. Also there's a paragraph
>>> about "Why not list every copyright holder?" that covers several points
>>> why I don't want to do that.
>>>
>>> But, if TAB says so I will do, perhaps spending hours of unproductive
>>> time looking up the whole history of contributors and adding year, name,
>>> company whatever to files.
> 
> There's no requirement to list every copyright holder, as most developers do
> not require it for acceptance. The issue I see here is that there's someone
> that does require it for you to accept their code.
> 
> The policy is simple. If someone requires a copyright notice for their
> code, you simply add it, or do not take their code. You can be specific
> about what that code is that is copyrighted. Perhaps just around the code in
> question or a description at the top.
> 
> Looking over the thread, I'm still confused at what the issue is. Is it
> that if you add one copyright notice you must do it for everyone else? Is
> everyone else asking for it? If not, just add the one and be done with it.

Thanks a lot Steve.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-26 11:41           ` Steven Rostedt
  2022-10-27 13:54             ` Johannes Thumshirn
@ 2022-10-31 12:19             ` David Sterba
  2022-10-31 16:06               ` Chris Mason
                                 ` (2 more replies)
  2022-11-11 17:57             ` David Sterba
  2 siblings, 3 replies; 108+ messages in thread
From: David Sterba @ 2022-10-31 12:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Johannes Thumshirn, dsterba, Chris Mason, Christoph Hellwig,
	Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Wed, Oct 26, 2022 at 07:41:45AM -0400, Steven Rostedt wrote:
> On Wed, 26 Oct 2022 07:36:45 +0000
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote:
> > On 24.10.22 19:11, David Sterba wrote:
> > > On Mon, Oct 24, 2022 at 11:25:04AM -0400, Chris Mason wrote:  
> > >> On 10/24/22 10:44 AM, Christoph Hellwig wrote:  
> > >>> On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:  
> > >>>> David, what's your plan to progress with this series?  
> > >>>
> > >>> FYI, I object to merging any of my code into btrfs without a proper
> > >>> copyright notice, and I also need to find some time to remove my
> > >>> previous significant changes given that the btrfs maintainer
> > >>> refuses to take the proper and legally required copyright notice.
> > >>>
> > >>> So don't waste any of your time on this.  
> > >>
> > >> Christoph's request is well within the norms for the kernel, given that 
> > >> he's making substantial changes to these files.  I talked this over with 
> > >> GregKH, who pointed me at:
> > >>
> > >> https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects
> > >>
> > >> Even if we'd taken up some of the other policies suggested by this doc, 
> > >> I'd still defer to preferences of developers who have made significant 
> > >> changes.  
> > > 
> > > I've asked for recommendations or best practice similar to the SPDX
> > > process. Something that TAB can acknowledge and that is perhaps also
> > > consulted with lawyers. And understood within the linux project,
> > > not just that some dudes have an argument because it's all clear as mud
> > > and people are used to do things differently.
> > > 
> > > The link from linux foundation blog is nice but unless this is codified
> > > into the process it's just somebody's blog post. Also there's a paragraph
> > > about "Why not list every copyright holder?" that covers several points
> > > why I don't want to do that.
> > > 
> > > But, if TAB says so I will do, perhaps spending hours of unproductive
> > > time looking up the whole history of contributors and adding year, name,
> > > company whatever to files.
> 
> There's no requirement to list every copyright holder, as most developers do
> not require it for acceptance. The issue I see here is that there's someone
> that does require it for you to accept their code.

That this time it is a hard requirement is a first occurrence for me
acting as maintainer. In past years we had new code and I asked if the
notice needs to be there and asked for resend without it. The reason is
that we have git and complete change history, but that is apparently not
sufficient for everybody.

> The policy is simple. If someone requires a copyright notice for their
> code, you simply add it, or do not take their code. You can be specific
> about what that code is that is copyrighted. Perhaps just around the code in
> question or a description at the top.

Let's say it's OK for substantial amount of code. What if somebody
moves existing code that he did not write to a new file and adds a
copyright notice? We got stuck there, both sides have different answer.
I see it at minimum as unfair to the original code authors if not
completely wrong because it could appear as "stealing" ownership.

> Looking over the thread, I'm still confused at what the issue is. Is it
> that if you add one copyright notice you must do it for everyone else? Is
> everyone else asking for it? If not, just add the one and be done with it.

My motivation is to be fair to all contributors and stick to the project
standards (ideally defined in process). Adding a copyright notice after
several years of not taking them would rightfully raise questions from
past and current contributors what would deserve to be mentioned as
copyright holders.

This leaves me with 'all or nothing', where 'all' means to add the
notices where applicable and we can continue perhaps with more
contributions in the future. But that'll cost time and inventing how to
do it so everybody is satisfied with the result.

You may have missed the start of the discussions, https://lore.kernel.org/all/20220909101521.GS32411@twin.jikos.cz/ ,
Bradley Kuhn's reply https://lore.kernel.org/all/YyfNMcUM+OHn5qi8@ebb.org/ ,
the documented position on the notices https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ#Copyright_notices_in_files.2C_SPDX .

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-31 12:19             ` David Sterba
@ 2022-10-31 16:06               ` Chris Mason
  2022-11-02  4:00               ` Steven Rostedt
  2022-11-03  2:54               ` Theodore Ts'o
  2 siblings, 0 replies; 108+ messages in thread
From: Chris Mason @ 2022-10-31 16:06 UTC (permalink / raw)
  To: dsterba, Steven Rostedt
  Cc: Johannes Thumshirn, Christoph Hellwig, Chris Mason, Josef Bacik,
	David Sterba, Damien Le Moal, Naohiro Aota, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 10/31/22 8:19 AM, David Sterba wrote:
> On Wed, Oct 26, 2022 at 07:41:45AM -0400, Steven Rostedt wrote:
>> On Wed, 26 Oct 2022 07:36:45 +0000
>> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote:
>>> On 24.10.22 19:11, David Sterba wrote:
>>>> On Mon, Oct 24, 2022 at 11:25:04AM -0400, Chris Mason wrote:
>>>>> On 10/24/22 10:44 AM, Christoph Hellwig wrote:
>>>>>> On Mon, Oct 24, 2022 at 08:12:29AM +0000, Johannes Thumshirn wrote:
>>>>>>> David, what's your plan to progress with this series?
>>>>>>
>>>>>> FYI, I object to merging any of my code into btrfs without a proper
>>>>>> copyright notice, and I also need to find some time to remove my
>>>>>> previous significant changes given that the btrfs maintainer
>>>>>> refuses to take the proper and legally required copyright notice.
>>>>>>
>>>>>> So don't waste any of your time on this.
>>>>>
>>>>> Christoph's request is well within the norms for the kernel, given that
>>>>> he's making substantial changes to these files.  I talked this over with
>>>>> GregKH, who pointed me at:
>>>>>
>>>>>
>>>>> Even if we'd taken up some of the other policies suggested by this doc,
>>>>> I'd still defer to preferences of developers who have made significant
>>>>> changes.
>>>>
>>>> I've asked for recommendations or best practice similar to the SPDX
>>>> process. Something that TAB can acknowledge and that is perhaps also
>>>> consulted with lawyers. And understood within the linux project,
>>>> not just that some dudes have an argument because it's all clear as mud
>>>> and people are used to do things differently.
>>>>
>>>> The link from linux foundation blog is nice but unless this is codified
>>>> into the process it's just somebody's blog post. Also there's a paragraph
>>>> about "Why not list every copyright holder?" that covers several points
>>>> why I don't want to do that.
>>>>
>>>> But, if TAB says so I will do, perhaps spending hours of unproductive
>>>> time looking up the whole history of contributors and adding year, name,
>>>> company whatever to files.
>>
>> There's no requirement to list every copyright holder, as most developers do
>> not require it for acceptance. The issue I see here is that there's someone
>> that does require it for you to accept their code.
> 
> That this time it is a hard requirement is a first occurrence for me
> acting as maintainer. In past years we had new code and I asked if the
> notice needs to be there and asked for resend without it. The reason is
> that we have git and complete change history, but that is apparently not
> sufficient for everybody.
> 
>> The policy is simple. If someone requires a copyright notice for their
>> code, you simply add it, or do not take their code. You can be specific
>> about what that code is that is copyrighted. Perhaps just around the code in
>> question or a description at the top.
> 
> Let's say it's OK for substantial amount of code. What if somebody
> moves existing code that he did not write to a new file and adds a
> copyright notice? We got stuck there, both sides have different answer.
> I see it at minimum as unfair to the original code authors if not
> completely wrong because it could appear as "stealing" ownership.

One option is to add a copyright line as suggested by the
LF blog post "Copyright The Btrfs Contributors", and to make it clear
the original authors of the old file are welcome to send patches
if they feel it is required.

> 
>> Looking over the thread, I'm still confused at what the issue is. Is it
>> that if you add one copyright notice you must do it for everyone else? Is
>> everyone else asking for it? If not, just add the one and be done with it.
> 
> My motivation is to be fair to all contributors and stick to the project
> standards (ideally defined in process). Adding a copyright notice after
> several years of not taking them would rightfully raise questions from
> past and current contributors what would deserve to be mentioned as
> copyright holders.
> 
> This leaves me with 'all or nothing', where 'all' means to add the
> notices where applicable and we can continue perhaps with more
> contributions in the future. But that'll cost time and inventing how to
> do it so everybody is satisfied with the result.

Everyone understands that you're trying to be fair, and I'm sure our
major contributors are happy to help.  I think the most reasonable
path forward is to add the blanket Btrfs Contributor copyright line
above and be open to additional lines for major changes (past or
present).

I'm definitely not suggesting that you (or anyone else) sit down with
git history and try to determine the perfect set of copyright lines for
past contributions.  It's just not required at all.

-chris


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-31 12:19             ` David Sterba
  2022-10-31 16:06               ` Chris Mason
@ 2022-11-02  4:00               ` Steven Rostedt
  2022-11-02  6:29                 ` Christoph Hellwig
  2022-11-03  2:54               ` Theodore Ts'o
  2 siblings, 1 reply; 108+ messages in thread
From: Steven Rostedt @ 2022-11-02  4:00 UTC (permalink / raw)
  To: David Sterba
  Cc: Johannes Thumshirn, Chris Mason, Christoph Hellwig, Chris Mason,
	Josef Bacik, David Sterba, Damien Le Moal, Naohiro Aota,
	Qu Wenruo, Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Mon, 31 Oct 2022 13:19:12 +0100
David Sterba <dsterba@suse.cz> wrote:

> > The policy is simple. If someone requires a copyright notice for their
> > code, you simply add it, or do not take their code. You can be specific
> > about what that code is that is copyrighted. Perhaps just around the code in
> > question or a description at the top.  
> 
> Let's say it's OK for substantial amount of code. What if somebody
> moves existing code that he did not write to a new file and adds a
> copyright notice? We got stuck there, both sides have different answer.
> I see it at minimum as unfair to the original code authors if not
> completely wrong because it could appear as "stealing" ownership.

Add the commit shas to the copyright, which will explicitly show the
actual code involved. As it's been pointed out in other places, the git
commits itself does not actually state who the copyright owner is.

> 
> > Looking over the thread, I'm still confused at what the issue is. Is it
> > that if you add one copyright notice you must do it for everyone else? Is
> > everyone else asking for it? If not, just add the one and be done with it.  
> 
> My motivation is to be fair to all contributors and stick to the project
> standards (ideally defined in process). Adding a copyright notice after
> several years of not taking them would rightfully raise questions from
> past and current contributors what would deserve to be mentioned as
> copyright holders.

As I stated: "If someone requires a copyright notice for their code,
you simply add it, or do not take their code."

No one is forcing you to add the copyright. You have an alternative.
Don't take the code. If your subsystem's policy is that of not adding
copyright notices, then the submitters should honor it. I see Christoph
as being OK for not accepting his code because of this policy.

Just like I will not submit to projects that require me to hand over my
copyright. It's their right to have that policy. It's my right not to
submit code to them. Or if I do submit, refuse to conform to their
policy, and have my code rejected because of it.

It really comes down to how badly do you want Christoph's code?

-- Steve

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-11-02  4:00               ` Steven Rostedt
@ 2022-11-02  6:29                 ` Christoph Hellwig
  2022-11-02 14:00                   ` Chris Mason
                                     ` (2 more replies)
  0 siblings, 3 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-11-02  6:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Sterba, Johannes Thumshirn, Chris Mason, Christoph Hellwig,
	Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Wed, Nov 02, 2022 at 12:00:22AM -0400, Steven Rostedt wrote:
> It really comes down to how badly do you want Christoph's code?

Well, Dave has made it clear implicily that he doesn't seem to care about
it at all through all this.  The painful part is that I need to come up
with a series to revert all the code that he refused to add the notice
for, which is quite involved and includes various bug fixes.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-11-02  6:29                 ` Christoph Hellwig
@ 2022-11-02 14:00                   ` Chris Mason
  2022-11-02 14:05                   ` Josef Bacik
  2022-11-02 20:20                   ` Andreas Dilger
  2 siblings, 0 replies; 108+ messages in thread
From: Chris Mason @ 2022-11-02 14:00 UTC (permalink / raw)
  To: Christoph Hellwig, Steven Rostedt
  Cc: David Sterba, Johannes Thumshirn, Chris Mason, Josef Bacik,
	David Sterba, Damien Le Moal, Naohiro Aota, Qu Wenruo,
	Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 11/2/22 2:29 AM, Christoph Hellwig wrote:
> On Wed, Nov 02, 2022 at 12:00:22AM -0400, Steven Rostedt wrote:
>> It really comes down to how badly do you want Christoph's code?
> 
> Well, Dave has made it clear implicily that he doesn't seem to care about
> it at all through all this.  The painful part is that I need to come up
> with a series to revert all the code that he refused to add the notice
> for, which is quite involved and includes various bug fixes.

I think he's mostly focused on finding a solution that's fair to the
rest of the contributors.  I'll keep working with Dave on ways to
get the lines in.

-chris


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-11-02  6:29                 ` Christoph Hellwig
  2022-11-02 14:00                   ` Chris Mason
@ 2022-11-02 14:05                   ` Josef Bacik
  2022-11-02 14:06                     ` Christoph Hellwig
  2022-11-02 20:20                   ` Andreas Dilger
  2 siblings, 1 reply; 108+ messages in thread
From: Josef Bacik @ 2022-11-02 14:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Steven Rostedt, David Sterba, Johannes Thumshirn, Chris Mason,
	Chris Mason, David Sterba, Damien Le Moal, Naohiro Aota,
	Qu Wenruo, Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On Wed, Nov 02, 2022 at 07:29:07AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 02, 2022 at 12:00:22AM -0400, Steven Rostedt wrote:
> > It really comes down to how badly do you want Christoph's code?
> 
> Well, Dave has made it clear implicily that he doesn't seem to care about
> it at all through all this.  The painful part is that I need to come up
> with a series to revert all the code that he refused to add the notice
> for, which is quite involved and includes various bug fixes.

Except he hasn't, he's clearly been trying to figure out what the best path
forward is by asking other people and pulling in the TAB.  I don't understand
why you're being so hostile still, clearly we're all trying to work on a
solution so we don't have to have this discussion in the future.  If you don't
want to contribute anymore then that's your choice, but Dave is clearly trying
to work towards a solution that works for everybody, and that includes taking
your copyright notices for your pending contributions.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-11-02 14:05                   ` Josef Bacik
@ 2022-11-02 14:06                     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-11-02 14:06 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, Steven Rostedt, David Sterba,
	Johannes Thumshirn, Chris Mason, Chris Mason, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Wed, Nov 02, 2022 at 10:05:17AM -0400, Josef Bacik wrote:
> Except he hasn't, he's clearly been trying to figure out what the best path
> forward is by asking other people and pulling in the TAB.  I don't understand
> why you're being so hostile still, clearly we're all trying to work on a
> solution so we don't have to have this discussion in the future.  If you don't
> want to contribute anymore then that's your choice, but Dave is clearly trying
> to work towards a solution that works for everybody, and that includes taking
> your copyright notices for your pending contributions.  Thanks,

Because that is no my impression.  To me it very much looks like he is
looking for more and more escapes to say no after the initial one did
not work out.  Which is really frustrating as btrfs has been making up
completly random rules with no precedence at all and then keeps going.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-11-02  6:29                 ` Christoph Hellwig
  2022-11-02 14:00                   ` Chris Mason
  2022-11-02 14:05                   ` Josef Bacik
@ 2022-11-02 20:20                   ` Andreas Dilger
  2022-11-02 22:07                     ` Chris Mason
  2 siblings, 1 reply; 108+ messages in thread
From: Andreas Dilger @ 2022-11-02 20:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Steven Rostedt, David Sterba, Johannes Thumshirn, Chris Mason,
	Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1486 bytes --]

On Nov 2, 2022, at 12:29 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Wed, Nov 02, 2022 at 12:00:22AM -0400, Steven Rostedt wrote:
>> It really comes down to how badly do you want Christoph's code?
> 
> Well, Dave has made it clear implicily that he doesn't seem to care about
> it at all through all this.  The painful part is that I need to come up
> with a series to revert all the code that he refused to add the notice
> for, which is quite involved and includes various bug fixes.

This may be an unpopular opinion for some, but since all of these previous
contributions to the kernel are under GPL, there is no "taking back the
older commits" from the btrfs code.  There is also no basis to prevent the
use/merge/rework or other modifications to GPL code, whether it is part of
btrfs or anywhere else in the kernel.  That is one of the strengths of the
GPL, is that you can't "take it back" after code has been released.  I don't
think anything David has done has violated the terms of the GPL itself.

David, as btrfs maintainer, doesn't even *have* to accept the patches to
revert changes to the btrfs code branch.  The only real option for Christoph
would be to chose not to contribute new fixes to btrfs in the future.


That said, it doesn't make sense to get into a pissing fight about this.
The best solution here is for Christoph and David to come to an amicable
agreement on what copyright notices that David might accept into the btrfs
code.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-11-02 20:20                   ` Andreas Dilger
@ 2022-11-02 22:07                     ` Chris Mason
  2022-11-03  8:49                       ` Christoph Hellwig
  0 siblings, 1 reply; 108+ messages in thread
From: Chris Mason @ 2022-11-02 22:07 UTC (permalink / raw)
  To: Andreas Dilger, Christoph Hellwig
  Cc: Steven Rostedt, David Sterba, Johannes Thumshirn, Chris Mason,
	Josef Bacik, David Sterba, Damien Le Moal, Naohiro Aota,
	Qu Wenruo, Jens Axboe, Darrick J. Wong, linux-block, linux-btrfs,
	linux-fsdevel

On 11/2/22 4:20 PM, Andreas Dilger wrote:
> On Nov 2, 2022, at 12:29 AM, Christoph Hellwig <hch@lst.de> wrote:
>>
>> On Wed, Nov 02, 2022 at 12:00:22AM -0400, Steven Rostedt wrote:
>>> It really comes down to how badly do you want Christoph's code?
>>
>> Well, Dave has made it clear implicily that he doesn't seem to care about
>> it at all through all this.  The painful part is that I need to come up
>> with a series to revert all the code that he refused to add the notice
>> for, which is quite involved and includes various bug fixes.
> 
> This may be an unpopular opinion for some, but since all of these previous
> contributions to the kernel are under GPL, there is no "taking back the
> older commits" from the btrfs code.  There is also no basis to prevent the
> use/merge/rework or other modifications to GPL code, whether it is part of
> btrfs or anywhere else in the kernel.  That is one of the strengths of the
> GPL, is that you can't "take it back" after code has been released.  I don't
> think anything David has done has violated the terms of the GPL itself.
> 
> David, as btrfs maintainer, doesn't even *have* to accept the patches to
> revert changes to the btrfs code branch.  The only real option for Christoph
> would be to chose not to contribute new fixes to btrfs in the future.
> 

This is all true, but it's definitely not the direction Sterba or any
of the other btrfs maintainers are going.  If it happened that way, I
wouldn't even blame someone for avoiding us in the future.

We'll never get that far because we've known each other a long time and
I know both Sterba and Christoph are working with good intent here.

> 
> That said, it doesn't make sense to get into a pissing fight about this.
> The best solution here is for Christoph and David to come to an amicable
> agreement on what copyright notices that David might accept into the btrfs
> code.

We talked about this at the btrfs meeting today and I'm sure it'll get
resolved soon.

-chris

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-31 12:19             ` David Sterba
  2022-10-31 16:06               ` Chris Mason
  2022-11-02  4:00               ` Steven Rostedt
@ 2022-11-03  2:54               ` Theodore Ts'o
  2 siblings, 0 replies; 108+ messages in thread
From: Theodore Ts'o @ 2022-11-03  2:54 UTC (permalink / raw)
  To: David Sterba
  Cc: Steven Rostedt, Johannes Thumshirn, Chris Mason,
	Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Mon, Oct 31, 2022 at 01:19:12PM +0100, David Sterba wrote:
> > The policy is simple. If someone requires a copyright notice for their
> > code, you simply add it, or do not take their code. You can be specific
> > about what that code is that is copyrighted. Perhaps just around the code in
> > question or a description at the top.
> 
> Let's say it's OK for substantial amount of code. What if somebody
> moves existing code that he did not write to a new file and adds a
> copyright notice? We got stuck there, both sides have different answer.
> I see it at minimum as unfair to the original code authors if not
> completely wrong because it could appear as "stealing" ownership.

So for whatever it's worth, my personal opinion is that it should be
the Maintainer's call, subject to override by Linus.  I don't think
it's really worthwhile to try to come up to a formal policy which
would need to define "substantial amount of code".  This is an area
where I think it makes sense to assume that Maintainers will be
reasonable and make decisions that makes sense.  Ultimately, I think
Chris's phrasing is the one that makes sense.  How much do you want
the contribution?  If you want the contribution, and the contributor
requests that they want an explicit copyright notification --- make a
choice.  You can either tell Christoph no, and revert the changes, or
accept his request to include a copyright statement.

I understand your concern about "fairness" to other contributors ---
is this a hypothetical, or actual concern?  Are there other
significant contributors who have explicitly told you that they will
be mortally offended if Cristoph's request is honored, and their code
contribution was not so recognized?  It's unclear to whether this is a
theoretical or practical concern?

If it is a practical concern, how many contributors have contributed
more than Cristoph that have asked, and how many extra lines of
copyright statements are we're talking about?   2?  3?  10?  100?

Remember, if someone sends you whitespace fixups or doubled word fixes
found using checkpatch, and demeands a copyright acknowledgement,
you're free to reject the patch.  (Heck, some maintainers reject
checkpatch --file fixups on general principles.  :-) So this is why I
think the overall "Linux project standard" should be: "maintainer
discretion".

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-11-02 22:07                     ` Chris Mason
@ 2022-11-03  8:49                       ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2022-11-03  8:49 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andreas Dilger, Christoph Hellwig, Steven Rostedt, David Sterba,
	Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba,
	Damien Le Moal, Naohiro Aota, Qu Wenruo, Jens Axboe,
	Darrick J. Wong, linux-block, linux-btrfs, linux-fsdevel

On Wed, Nov 02, 2022 at 06:07:27PM -0400, Chris Mason wrote:
> We talked about this at the btrfs meeting today and I'm sure it'll get
> resolved soon.

Thanks.  That wasn't my impression so far, but I'm glad I was wrong.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: consolidate btrfs checksumming, repair and bio splitting
  2022-10-26 11:41           ` Steven Rostedt
  2022-10-27 13:54             ` Johannes Thumshirn
  2022-10-31 12:19             ` David Sterba
@ 2022-11-11 17:57             ` David Sterba
  2 siblings, 0 replies; 108+ messages in thread
From: David Sterba @ 2022-11-11 17:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Johannes Thumshirn, dsterba, Chris Mason, Christoph Hellwig,
	Chris Mason, Josef Bacik, David Sterba, Damien Le Moal,
	Naohiro Aota, Qu Wenruo, Jens Axboe, Darrick J. Wong,
	linux-block, linux-btrfs, linux-fsdevel

On Wed, Oct 26, 2022 at 07:41:45AM -0400, Steven Rostedt wrote:
> On Wed, 26 Oct 2022 07:36:45 +0000
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote:
> 
> > [+Cc Steven ]
> > 
> > Steven, you're on the TAB, can you help with this issue?
> > Or bring it up with other TAB members?

Let me reply here in summary, based on what was discussed in the btrfs
developer group:

I got an answer from LF that I will use for contributions 'add copyright
or reject patches', thank you Steve. Btrfs stays open to contributions,
valid copyright notices will be added on request. I'll update the wiki
to reflect the status. Patches from Christoph are in the backlog and are
planned for merge.

^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2022-11-11 17:57 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-01  7:41 consolidate btrfs checksumming, repair and bio splitting Christoph Hellwig
2022-09-01  7:42 ` [PATCH 01/17] block: export bio_split_rw Christoph Hellwig
2022-09-01  8:02   ` Johannes Thumshirn
2022-09-01  8:54   ` Qu Wenruo
2022-09-05  6:44     ` Christoph Hellwig
2022-09-05  6:51       ` Qu Wenruo
2022-09-07 17:51   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 02/17] btrfs: stop tracking failed reads in the I/O tree Christoph Hellwig
2022-09-01  8:55   ` Qu Wenruo
2022-09-07 17:52   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 03/17] btrfs: move repair_io_failure to volumes.c Christoph Hellwig
2022-09-07 17:54   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 04/17] btrfs: handle checksum validation and repair at the storage layer Christoph Hellwig
2022-09-01  9:04   ` Qu Wenruo
2022-09-05  6:48     ` Christoph Hellwig
2022-09-05  6:59       ` Qu Wenruo
2022-09-05 14:31         ` Christoph Hellwig
2022-09-05 22:34           ` Qu Wenruo
2022-09-06  4:34             ` Christoph Hellwig
2022-09-07 18:15   ` Josef Bacik
2022-09-12 13:57     ` Christoph Hellwig
2022-09-01  7:42 ` [PATCH 05/17] btrfs: handle checksum generation in " Christoph Hellwig
2022-09-07 20:33   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 06/17] btrfs: handle recording of zoned writes " Christoph Hellwig
2022-09-01  9:44   ` Johannes Thumshirn
2022-09-07 20:36   ` Josef Bacik
2022-09-12  6:11   ` Naohiro Aota
2022-09-01  7:42 ` [PATCH 07/17] btrfs: allow btrfs_submit_bio to split bios Christoph Hellwig
2022-09-01  9:47   ` Johannes Thumshirn
2022-09-07 20:55   ` Josef Bacik
2022-09-12 13:58     ` Christoph Hellwig
2022-09-12  0:20   ` Qu Wenruo
2022-09-12 13:55     ` Christoph Hellwig
2022-09-12 22:23       ` Qu Wenruo
2022-09-01  7:42 ` [PATCH 08/17] btrfs: pass the iomap bio to btrfs_submit_bio Christoph Hellwig
2022-09-07 21:00   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 09/17] btrfs: remove stripe boundary calculation for buffered I/O Christoph Hellwig
2022-09-07 21:04   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 10/17] btrfs: remove stripe boundary calculation for compressed I/O Christoph Hellwig
2022-09-01  9:56   ` Johannes Thumshirn
2022-09-05  6:49     ` Christoph Hellwig
2022-09-07 21:07   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 11/17] btrfs: remove stripe boundary calculation for encoded I/O Christoph Hellwig
2022-09-01  9:58   ` Johannes Thumshirn
2022-09-07 21:08   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 12/17] btrfs: remove struct btrfs_io_geometry Christoph Hellwig
2022-09-07 21:10   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 13/17] btrfs: remove submit_encoded_read_bio Christoph Hellwig
2022-09-01 10:02   ` Johannes Thumshirn
2022-09-07 21:11   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 14/17] btrfs: remove now spurious bio submission helpers Christoph Hellwig
2022-09-01 10:14   ` Johannes Thumshirn
2022-09-07 21:12   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 15/17] btrfs: calculate file system wide queue limit for zoned mode Christoph Hellwig
2022-09-01 11:28   ` Johannes Thumshirn
2022-09-05  6:50     ` Christoph Hellwig
2022-09-02  1:56   ` Damien Le Moal
2022-09-02  1:59     ` Damien Le Moal
2022-09-05  6:54     ` Christoph Hellwig
2022-09-01  7:42 ` [PATCH 16/17] btrfs: split zone append bios in btrfs_submit_bio Christoph Hellwig
2022-09-02  1:46   ` Damien Le Moal
2022-09-05  6:55     ` Christoph Hellwig
2022-09-05 13:15   ` Johannes Thumshirn
2022-09-05 14:25     ` Christoph Hellwig
2022-09-05 14:31       ` Johannes Thumshirn
2022-09-05 14:39         ` Christoph Hellwig
2022-09-05 14:43           ` Johannes Thumshirn
2022-09-05 15:30           ` Johannes Thumshirn
2022-09-07 21:17   ` Josef Bacik
2022-09-01  7:42 ` [PATCH 17/17] iomap: remove IOMAP_F_ZONE_APPEND Christoph Hellwig
2022-09-01 10:46   ` Johannes Thumshirn
2022-09-02  1:38   ` Damien Le Moal
2022-09-05  6:50     ` Christoph Hellwig
2022-09-05  6:57       ` Damien Le Moal
2022-09-07 21:18   ` Josef Bacik
2022-09-02 15:18 ` consolidate btrfs checksumming, repair and bio splitting Johannes Thumshirn
2022-09-07  9:10 ` code placement for bio / storage layer code Christoph Hellwig
2022-09-07  9:46   ` Johannes Thumshirn
2022-09-07 10:28   ` Qu Wenruo
2022-09-07 11:10     ` Christoph Hellwig
2022-09-07 11:27       ` Qu Wenruo
2022-09-07 11:35         ` Christoph Hellwig
2022-10-10  8:01   ` Johannes Thumshirn
2022-10-24  8:12 ` consolidate btrfs checksumming, repair and bio splitting Johannes Thumshirn
2022-10-24  8:20   ` Qu Wenruo
2022-10-24  9:07     ` Johannes Thumshirn
2022-10-24  9:18       ` Qu Wenruo
2022-10-24 10:21         ` Johannes Thumshirn
2022-10-24 14:44   ` Christoph Hellwig
2022-10-24 15:25     ` Chris Mason
2022-10-24 17:10       ` David Sterba
2022-10-24 17:34         ` Chris Mason
2022-10-24 22:18           ` Damien Le Moal
2022-10-26  7:36         ` Johannes Thumshirn
2022-10-26 11:41           ` Steven Rostedt
2022-10-27 13:54             ` Johannes Thumshirn
2022-10-31 12:19             ` David Sterba
2022-10-31 16:06               ` Chris Mason
2022-11-02  4:00               ` Steven Rostedt
2022-11-02  6:29                 ` Christoph Hellwig
2022-11-02 14:00                   ` Chris Mason
2022-11-02 14:05                   ` Josef Bacik
2022-11-02 14:06                     ` Christoph Hellwig
2022-11-02 20:20                   ` Andreas Dilger
2022-11-02 22:07                     ` Chris Mason
2022-11-03  8:49                       ` Christoph Hellwig
2022-11-03  2:54               ` Theodore Ts'o
2022-11-11 17:57             ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.