All of lore.kernel.org
 help / color / mirror / Atom feed
* don't split ordered_extents for zoned writes at I/O submission time
@ 2023-05-24 15:03 Christoph Hellwig
  2023-05-24 15:03 ` [PATCH 01/14] btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED Christoph Hellwig
                   ` (15 more replies)
  0 siblings, 16 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

Hi all,

this series removes the unconditional splitting of ordered extents for
zoned writes at I/O submission time, and instead splits them only when
actually needed in the I/O completion handler.

This something I've been wanting to do for a while as it is a lot more
efficient, but it is also really needed to make the ordered_extent
pointer in the btrfs_bio work for zoned file systems, which made it a bit
more urgent.  This series also borrow two patches from the series that
adds the ordered_extent pointer to the btrfs_bio.

Currently due to the submission time splitting, there is one extra lookup
in both the ordered extent tree and inode extent tree for each bio
submission, and extra ordered extent lookup in the bio completion
handler, and two extent tree lookups per bio / split ordered extent in
the ordered extent completion handler.  With this series this is reduced
to a single inode extent tree lookup for the best case, with an extra
inode extent tree and ordered extent lookup when a split actually needs
to occur due to reordering inside an ordered_extent.

Diffstat:
 bio.c          |   20 -----
 bio.h          |   17 +++-
 btrfs_inode.h  |    2 
 extent_map.c   |  101 ++++++++++++++++++++++++++--
 extent_map.h   |    6 -
 file-item.c    |   15 +++-
 fs.h           |    2 
 inode.c        |  138 +++++++-------------------------------
 ordered-data.c |  205 +++++++++++++++++++++++++++++++++++----------------------
 ordered-data.h |   22 ++----
 relocation.c   |    4 -
 tree-log.c     |   12 +--
 zoned.c        |   94 ++++++++++++++------------
 zoned.h        |    6 -
 14 files changed, 352 insertions(+), 292 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/14] btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25  8:33   ` Johannes Thumshirn
  2023-05-24 15:03 ` [PATCH 02/14] btrfs: don't call btrfs_record_physical_zoned for failed append Christoph Hellwig
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

Add an IS_ENABLED check for CONFIG_BLK_DEV_ZONED in addition to the
run-time check for the zone size.  This will allows to make use of
compiler dead code elimination for code guarded by btrfs_is_zoned, and
for example provide just a dangling prototype for a function instead
of adding a stub.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 840e4def18b519..5dd24c2916a1f3 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -853,7 +853,7 @@ static inline u64 btrfs_calc_metadata_size(const struct btrfs_fs_info *fs_info,
 
 static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 {
-	return fs_info->zone_size > 0;
+	return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && fs_info->zone_size > 0;
 }
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/14] btrfs: don't call btrfs_record_physical_zoned for failed append
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
  2023-05-24 15:03 ` [PATCH 01/14] btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25  8:33   ` Johannes Thumshirn
  2023-05-24 15:03 ` [PATCH 03/14] btrfs: mark the len field in struct btrfs_ordered_sum as unsigned Christoph Hellwig
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

When a zoned append command fails there is no written address reported,
so don't try to record it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/bio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 4ecc5f31c0190e..5fad6e032e6c76 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -348,7 +348,7 @@ static void btrfs_simple_end_io(struct bio *bio)
 		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
 		queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
 	} else {
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
 			btrfs_record_physical_zoned(bbio);
 		btrfs_orig_bbio_end_io(bbio);
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/14] btrfs: mark the len field in struct btrfs_ordered_sum as unsigned
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
  2023-05-24 15:03 ` [PATCH 01/14] btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED Christoph Hellwig
  2023-05-24 15:03 ` [PATCH 02/14] btrfs: don't call btrfs_record_physical_zoned for failed append Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25  8:33   ` Johannes Thumshirn
  2023-05-30 16:45   ` David Sterba
  2023-05-24 15:03 ` [PATCH 04/14] btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical Christoph Hellwig
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

len can't ever be negative, so mark it as an u32 instead of int.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ordered-data.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index f0f1138d23c331..2e54820a5e6ff7 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -20,7 +20,7 @@ struct btrfs_ordered_sum {
 	/*
 	 * this is the length in bytes covered by the sums array below.
 	 */
-	int len;
+	u32 len;
 	struct list_head list;
 	/* last field is a variable length array of csums */
 	u8 sums[];
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/14] btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (2 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 03/14] btrfs: mark the len field in struct btrfs_ordered_sum as unsigned Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25  8:33   ` Johannes Thumshirn
  2023-05-24 15:03 ` [PATCH 05/14] btrfs: optimize the logical to physical mapping for zoned writes Christoph Hellwig
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

->bytenr in struct btrfs_ordered_sum stores a logical address.  Make that
clear by renaming it to ->logical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/file-item.c    |  8 ++++----
 fs/btrfs/inode.c        |  2 +-
 fs/btrfs/ordered-data.h |  8 ++++----
 fs/btrfs/relocation.c   |  4 ++--
 fs/btrfs/tree-log.c     | 12 ++++++------
 fs/btrfs/zoned.c        |  4 ++--
 6 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index e74b9804bcdec8..ca321126816e94 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -560,7 +560,7 @@ int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
 				goto fail;
 			}
 
-			sums->bytenr = start;
+			sums->logical = start;
 			sums->len = (int)size;
 
 			offset = bytes_to_csum_size(fs_info, start - key.offset);
@@ -749,7 +749,7 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio)
 	sums->len = bio->bi_iter.bi_size;
 	INIT_LIST_HEAD(&sums->list);
 
-	sums->bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	index = 0;
 
 	shash->tfm = fs_info->csum_shash;
@@ -799,7 +799,7 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio)
 				ordered = btrfs_lookup_ordered_extent(inode,
 								offset);
 				ASSERT(ordered); /* Logic error */
-				sums->bytenr = (bio->bi_iter.bi_sector << SECTOR_SHIFT)
+				sums->logical = (bio->bi_iter.bi_sector << SECTOR_SHIFT)
 					+ total_bytes;
 				index = 0;
 			}
@@ -1086,7 +1086,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 again:
 	next_offset = (u64)-1;
 	found_next = 0;
-	bytenr = sums->bytenr + total_bytes;
+	bytenr = sums->logical + total_bytes;
 	file_key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
 	file_key.offset = bytenr;
 	file_key.type = BTRFS_EXTENT_CSUM_KEY;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2e83fb22505261..253ad6206179ce 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2850,7 +2850,7 @@ static int add_pending_csums(struct btrfs_trans_handle *trans,
 		trans->adding_csums = true;
 		if (!csum_root)
 			csum_root = btrfs_csum_root(trans->fs_info,
-						    sum->bytenr);
+						    sum->logical);
 		ret = btrfs_csum_file_blocks(trans, csum_root, sum);
 		trans->adding_csums = false;
 		if (ret)
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 2e54820a5e6ff7..ebc980ac967ad4 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -14,13 +14,13 @@ struct btrfs_ordered_inode_tree {
 };
 
 struct btrfs_ordered_sum {
-	/* bytenr is the start of this extent on disk */
-	u64 bytenr;
-
 	/*
-	 * this is the length in bytes covered by the sums array below.
+	 * Logical start address and length for of the blocks covered by
+	 * the sums array.
 	 */
+	u64 logical;
 	u32 len;
+
 	struct list_head list;
 	/* last field is a variable length array of csums */
 	u8 sums[];
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 38cfbd38a81990..770a7ffaf22425 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4379,8 +4379,8 @@ int btrfs_reloc_clone_csums(struct btrfs_inode *inode, u64 file_pos, u64 len)
 		 * disk_len vs real len like with real inodes since it's all
 		 * disk length.
 		 */
-		new_bytenr = ordered->disk_bytenr + sums->bytenr - disk_bytenr;
-		sums->bytenr = new_bytenr;
+		new_bytenr = ordered->disk_bytenr + sums->logical - disk_bytenr;
+		sums->logical = new_bytenr;
 
 		btrfs_add_ordered_sum(ordered, sums);
 	}
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index ecb73da5d25aa3..f91a6175fd117e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -859,10 +859,10 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 						struct btrfs_ordered_sum,
 						list);
 				csum_root = btrfs_csum_root(fs_info,
-							    sums->bytenr);
+							    sums->logical);
 				if (!ret)
 					ret = btrfs_del_csums(trans, csum_root,
-							      sums->bytenr,
+							      sums->logical,
 							      sums->len);
 				if (!ret)
 					ret = btrfs_csum_file_blocks(trans,
@@ -4221,7 +4221,7 @@ static int log_csums(struct btrfs_trans_handle *trans,
 		     struct btrfs_root *log_root,
 		     struct btrfs_ordered_sum *sums)
 {
-	const u64 lock_end = sums->bytenr + sums->len - 1;
+	const u64 lock_end = sums->logical + sums->len - 1;
 	struct extent_state *cached_state = NULL;
 	int ret;
 
@@ -4239,7 +4239,7 @@ static int log_csums(struct btrfs_trans_handle *trans,
 	 * file which happens to refer to the same extent as well. Such races
 	 * can leave checksum items in the log with overlapping ranges.
 	 */
-	ret = lock_extent(&log_root->log_csum_range, sums->bytenr, lock_end,
+	ret = lock_extent(&log_root->log_csum_range, sums->logical, lock_end,
 			  &cached_state);
 	if (ret)
 		return ret;
@@ -4252,11 +4252,11 @@ static int log_csums(struct btrfs_trans_handle *trans,
 	 * some checksums missing in the fs/subvolume tree. So just delete (or
 	 * trim and adjust) any existing csum items in the log for this range.
 	 */
-	ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len);
+	ret = btrfs_del_csums(trans, log_root, sums->logical, sums->len);
 	if (!ret)
 		ret = btrfs_csum_file_blocks(trans, log_root, sums);
 
-	unlock_extent(&log_root->log_csum_range, sums->bytenr, lock_end,
+	unlock_extent(&log_root->log_csum_range, sums->logical, lock_end,
 		      &cached_state);
 
 	return ret;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index bda301a55cbe3b..76411d96fa7afc 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1711,9 +1711,9 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 
 	list_for_each_entry(sum, &ordered->list, list) {
 		if (logical < orig_logical)
-			sum->bytenr -= orig_logical - logical;
+			sum->logical -= orig_logical - logical;
 		else
-			sum->bytenr += logical - orig_logical;
+			sum->logical += logical - orig_logical;
 	}
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/14] btrfs: optimize the logical to physical mapping for zoned writes
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (3 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 04/14] btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25 10:56   ` Johannes Thumshirn
  2023-08-18 14:03   ` Naohiro Aota
  2023-05-24 15:03 ` [PATCH 06/14] btrfs: move split_extent_map to extent_map.c Christoph Hellwig
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

The current code to store the final logical to physical mapping for a
zone append write in the extent tree is rather inefficient.  It first has
to split the ordered extent so that there is one ordered extent per bio,
so that it can look up the ordered extent on I/O completion in
btrfs_record_physical_zoned and store the physical LBA returned by the
block driver in the ordered extent.

btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
see what physical address the logical address for this bio / ordered
extent is mapped to, and then rewrite it in the extent tree.

To optimize this process, we can store the physical address assigned in
the chunk tree to the original logical address and a pointer to
btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
this information to rewrite the logical address in the btrfs_ordered_sum
structure directly at I/O completion time in btrfs_record_physical_zoned.
btrfs_rewrite_logical_zoned then simply updates the logical address in
the extent tree and the ordered_extent itself.

The code in btrfs_rewrite_logical_zoned now runs for all data I/O
completions in zoned file systems, which is fine as there is no remapping
to do for non-append writes to conventional zones or for relocation, and
the overhead for quickly breaking out of the loop is very low.

Note that the btrfs_bio doesn't grow as the new field are places into
a union that is so far not used for data writes and has plenty of space
left in it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/bio.c          |  1 +
 fs/btrfs/bio.h          | 17 +++++++++++---
 fs/btrfs/file-item.c    |  7 ++++++
 fs/btrfs/inode.c        |  6 +----
 fs/btrfs/ordered-data.c |  1 -
 fs/btrfs/ordered-data.h |  6 -----
 fs/btrfs/zoned.c        | 51 ++++++++---------------------------------
 7 files changed, 33 insertions(+), 56 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 5fad6e032e6c76..8a4d3b707dd1b2 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -431,6 +431,7 @@ static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
 		u64 zone_start = round_down(physical, dev->fs_info->zone_size);
 
 		ASSERT(btrfs_dev_is_sequential(dev, physical));
+		btrfs_bio(bio)->orig_physical = physical;
 		bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
 	}
 	btrfs_debug_in_rcu(dev->fs_info,
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index a8eca3a6567320..8a29980159b404 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -39,8 +39,8 @@ struct btrfs_bio {
 
 	union {
 		/*
-		 * Data checksumming and original I/O information for internal
-		 * use in the btrfs_submit_bio machinery.
+		 * For data reads: checksumming and original I/O information.
+		 * (for internal use in the btrfs_submit_bio machinery only)
 		 */
 		struct {
 			u8 *csum;
@@ -48,7 +48,18 @@ struct btrfs_bio {
 			struct bvec_iter saved_iter;
 		};
 
-		/* For metadata parentness verification. */
+		/*
+		 * For data writes:
+		 * - pointer to the checksums for this bio
+		 * - original physical address from the allocator
+		 *   (for zone append only)
+		 */
+		struct {
+			struct btrfs_ordered_sum *sums;
+			u64 orig_physical;
+		};
+
+		/* For metadata reads: parentness verification. */
 		struct btrfs_tree_parent_check parent_check;
 	};
 
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index ca321126816e94..96b54d73ba376d 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -818,6 +818,13 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio)
 
 	}
 	this_sum_bytes = 0;
+
+	/*
+	 * The ->sums assignment is for zoned writes, where a bio never spans
+	 * ordered extents and only done unconditionally because that's cheaper
+	 * than a branch.
+	 */
+	bbio->sums = sums;
 	btrfs_add_ordered_sum(ordered, sums);
 	btrfs_put_ordered_extent(ordered);
 	return 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 253ad6206179ce..2eee57d07d3632 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3302,14 +3302,10 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
-	/* A valid ->physical implies a write on a sequential zone. */
-	if (ordered_extent->physical != (u64)-1) {
+	if (btrfs_is_zoned(fs_info)) {
 		btrfs_rewrite_logical_zoned(ordered_extent);
 		btrfs_zone_finish_endio(fs_info, ordered_extent->disk_bytenr,
 					ordered_extent->disk_num_bytes);
-	} else if (btrfs_is_data_reloc_root(inode->root)) {
-		btrfs_zone_finish_endio(fs_info, ordered_extent->disk_bytenr,
-					ordered_extent->disk_num_bytes);
 	}
 
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a9778a91511e19..324a5a8c844a72 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -209,7 +209,6 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
 	entry->qgroup_rsv = ret;
-	entry->physical = (u64)-1;
 
 	ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
 	entry->flags = flags;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index ebc980ac967ad4..dc700aa515b58b 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -151,12 +151,6 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
-
-	/*
-	 * Used to reverse-map physical address returned from ZONE_APPEND write
-	 * command in a workqueue context
-	 */
-	u64 physical;
 };
 
 static inline void
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 76411d96fa7afc..e838c2037634c2 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1657,64 +1657,33 @@ bool btrfs_use_zone_append(struct btrfs_bio *bbio)
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 {
 	const u64 physical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
-	struct btrfs_ordered_extent *ordered;
+	struct btrfs_ordered_sum *sum = bbio->sums;
 
-	ordered = btrfs_lookup_ordered_extent(bbio->inode, bbio->file_offset);
-	if (WARN_ON(!ordered))
-		return;
-
-	ordered->physical = physical;
-	btrfs_put_ordered_extent(ordered);
+	if (physical < bbio->orig_physical)
+		sum->logical -= bbio->orig_physical - physical;
+	else
+		sum->logical += physical - bbio->orig_physical;
 }
 
 void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 {
-	struct btrfs_inode *inode = BTRFS_I(ordered->inode);
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_map_tree *em_tree;
+	struct extent_map_tree *em_tree = &BTRFS_I(ordered->inode)->extent_tree;
 	struct extent_map *em;
-	struct btrfs_ordered_sum *sum;
-	u64 orig_logical = ordered->disk_bytenr;
-	struct map_lookup *map;
-	u64 physical = ordered->physical;
-	u64 chunk_start_phys;
-	u64 logical;
-
-	em = btrfs_get_chunk_map(fs_info, orig_logical, 1);
-	if (IS_ERR(em))
-		return;
-	map = em->map_lookup;
-	chunk_start_phys = map->stripes[0].physical;
+	struct btrfs_ordered_sum *sum =
+		list_first_entry(&ordered->list, typeof(*sum), list);
+	u64 logical = sum->logical;
 
-	if (WARN_ON_ONCE(map->num_stripes > 1) ||
-	    WARN_ON_ONCE((map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) != 0) ||
-	    WARN_ON_ONCE(physical < chunk_start_phys) ||
-	    WARN_ON_ONCE(physical > chunk_start_phys + em->orig_block_len)) {
-		free_extent_map(em);
-		return;
-	}
-	logical = em->start + (physical - map->stripes[0].physical);
-	free_extent_map(em);
-
-	if (orig_logical == logical)
+	if (ordered->disk_bytenr == logical)
 		return;
 
 	ordered->disk_bytenr = logical;
 
-	em_tree = &inode->extent_tree;
 	write_lock(&em_tree->lock);
 	em = search_extent_mapping(em_tree, ordered->file_offset,
 				   ordered->num_bytes);
 	em->block_start = logical;
 	free_extent_map(em);
 	write_unlock(&em_tree->lock);
-
-	list_for_each_entry(sum, &ordered->list, list) {
-		if (logical < orig_logical)
-			sum->logical -= orig_logical - logical;
-		else
-			sum->logical += logical - orig_logical;
-	}
 }
 
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/14] btrfs: move split_extent_map to extent_map.c
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (4 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 05/14] btrfs: optimize the logical to physical mapping for zoned writes Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25 10:58   ` Johannes Thumshirn
  2023-05-24 15:03 ` [PATCH 07/14] btrfs: reorder btrfs_extract_ordered_extent Christoph Hellwig
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

split_extent_map doesn't have anything to do with the other code in
inode.c, so move it to extent_map.c.

This also allows marking replace_extent_mapping static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_map.c | 99 +++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/extent_map.h |  5 +--
 fs/btrfs/inode.c      | 90 ---------------------------------------
 3 files changed, 96 insertions(+), 98 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 138afa955370bc..21cc4991360221 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -502,10 +502,10 @@ void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
 	RB_CLEAR_NODE(&em->rb_node);
 }
 
-void replace_extent_mapping(struct extent_map_tree *tree,
-			    struct extent_map *cur,
-			    struct extent_map *new,
-			    int modified)
+static void replace_extent_mapping(struct extent_map_tree *tree,
+				   struct extent_map *cur,
+				   struct extent_map *new,
+				   int modified)
 {
 	lockdep_assert_held_write(&tree->lock);
 
@@ -959,3 +959,94 @@ int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 
 	return ret;
 }
+
+/*
+ * Split off the first pre bytes from the extent_map at [start, start + len]
+ *
+ * This function is used when an ordered_extent needs to be split.
+ */
+int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre)
+{
+	struct extent_map_tree *em_tree = &inode->extent_tree;
+	struct extent_map *em;
+	struct extent_map *split_pre = NULL;
+	struct extent_map *split_mid = NULL;
+	int ret = 0;
+	unsigned long flags;
+
+	ASSERT(pre != 0);
+	ASSERT(pre < len);
+
+	split_pre = alloc_extent_map();
+	if (!split_pre)
+		return -ENOMEM;
+	split_mid = alloc_extent_map();
+	if (!split_mid) {
+		ret = -ENOMEM;
+		goto out_free_pre;
+	}
+
+	lock_extent(&inode->io_tree, start, start + len - 1, NULL);
+	write_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, start, len);
+	if (!em) {
+		ret = -EIO;
+		goto out_unlock;
+	}
+
+	ASSERT(em->len == len);
+	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
+	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
+	ASSERT(test_bit(EXTENT_FLAG_PINNED, &em->flags));
+	ASSERT(!test_bit(EXTENT_FLAG_LOGGING, &em->flags));
+	ASSERT(!list_empty(&em->list));
+
+	flags = em->flags;
+	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
+
+	/* First, replace the em with a new extent_map starting from * em->start */
+	split_pre->start = em->start;
+	split_pre->len = pre;
+	split_pre->orig_start = split_pre->start;
+	split_pre->block_start = em->block_start;
+	split_pre->block_len = split_pre->len;
+	split_pre->orig_block_len = split_pre->block_len;
+	split_pre->ram_bytes = split_pre->len;
+	split_pre->flags = flags;
+	split_pre->compress_type = em->compress_type;
+	split_pre->generation = em->generation;
+
+	replace_extent_mapping(em_tree, em, split_pre, 1);
+
+	/*
+	 * Now we only have an extent_map at:
+	 *     [em->start, em->start + pre]
+	 */
+
+	/* Insert the middle extent_map. */
+	split_mid->start = em->start + pre;
+	split_mid->len = em->len - pre;
+	split_mid->orig_start = split_mid->start;
+	split_mid->block_start = em->block_start + pre;
+	split_mid->block_len = split_mid->len;
+	split_mid->orig_block_len = split_mid->block_len;
+	split_mid->ram_bytes = split_mid->len;
+	split_mid->flags = flags;
+	split_mid->compress_type = em->compress_type;
+	split_mid->generation = em->generation;
+	add_extent_mapping(em_tree, split_mid, 1);
+
+	/* Once for us */
+	free_extent_map(em);
+	/* Once for the tree */
+	free_extent_map(em);
+
+out_unlock:
+	write_unlock(&em_tree->lock);
+	unlock_extent(&inode->io_tree, start, start + len - 1, NULL);
+	free_extent_map(split_mid);
+out_free_pre:
+	free_extent_map(split_pre);
+	return ret;
+}
+
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index ad311864272a00..7df39112388dde 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -90,10 +90,7 @@ struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
 int add_extent_mapping(struct extent_map_tree *tree,
 		       struct extent_map *em, int modified);
 void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
-void replace_extent_mapping(struct extent_map_tree *tree,
-			    struct extent_map *cur,
-			    struct extent_map *new,
-			    int modified);
+int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre);
 
 struct extent_map *alloc_extent_map(void);
 void free_extent_map(struct extent_map *em);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2eee57d07d3632..781bd0d48f02ce 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2714,96 +2714,6 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 	}
 }
 
-/*
- * Split off the first pre bytes from the extent_map at [start, start + len]
- *
- * This function is intended to be used only for extract_ordered_extent().
- */
-static int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre)
-{
-	struct extent_map_tree *em_tree = &inode->extent_tree;
-	struct extent_map *em;
-	struct extent_map *split_pre = NULL;
-	struct extent_map *split_mid = NULL;
-	int ret = 0;
-	unsigned long flags;
-
-	ASSERT(pre != 0);
-	ASSERT(pre < len);
-
-	split_pre = alloc_extent_map();
-	if (!split_pre)
-		return -ENOMEM;
-	split_mid = alloc_extent_map();
-	if (!split_mid) {
-		ret = -ENOMEM;
-		goto out_free_pre;
-	}
-
-	lock_extent(&inode->io_tree, start, start + len - 1, NULL);
-	write_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree, start, len);
-	if (!em) {
-		ret = -EIO;
-		goto out_unlock;
-	}
-
-	ASSERT(em->len == len);
-	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
-	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
-	ASSERT(test_bit(EXTENT_FLAG_PINNED, &em->flags));
-	ASSERT(!test_bit(EXTENT_FLAG_LOGGING, &em->flags));
-	ASSERT(!list_empty(&em->list));
-
-	flags = em->flags;
-	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
-
-	/* First, replace the em with a new extent_map starting from * em->start */
-	split_pre->start = em->start;
-	split_pre->len = pre;
-	split_pre->orig_start = split_pre->start;
-	split_pre->block_start = em->block_start;
-	split_pre->block_len = split_pre->len;
-	split_pre->orig_block_len = split_pre->block_len;
-	split_pre->ram_bytes = split_pre->len;
-	split_pre->flags = flags;
-	split_pre->compress_type = em->compress_type;
-	split_pre->generation = em->generation;
-
-	replace_extent_mapping(em_tree, em, split_pre, 1);
-
-	/*
-	 * Now we only have an extent_map at:
-	 *     [em->start, em->start + pre]
-	 */
-
-	/* Insert the middle extent_map. */
-	split_mid->start = em->start + pre;
-	split_mid->len = em->len - pre;
-	split_mid->orig_start = split_mid->start;
-	split_mid->block_start = em->block_start + pre;
-	split_mid->block_len = split_mid->len;
-	split_mid->orig_block_len = split_mid->block_len;
-	split_mid->ram_bytes = split_mid->len;
-	split_mid->flags = flags;
-	split_mid->compress_type = em->compress_type;
-	split_mid->generation = em->generation;
-	add_extent_mapping(em_tree, split_mid, 1);
-
-	/* Once for us */
-	free_extent_map(em);
-	/* Once for the tree */
-	free_extent_map(em);
-
-out_unlock:
-	write_unlock(&em_tree->lock);
-	unlock_extent(&inode->io_tree, start, start + len - 1, NULL);
-	free_extent_map(split_mid);
-out_free_pre:
-	free_extent_map(split_pre);
-	return ret;
-}
-
 int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 				 struct btrfs_ordered_extent *ordered)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/14] btrfs: reorder btrfs_extract_ordered_extent
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (5 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 06/14] btrfs: move split_extent_map to extent_map.c Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-24 15:03 ` [PATCH 08/14] btrfs: return the new ordered_extent from btrfs_split_ordered_extent Christoph Hellwig
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

There is no good reason for doing one before the other in terms of
failure implications, but doing the extent_map split first will
simplify some upcoming refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/inode.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 781bd0d48f02ce..5de029ef0b399c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2719,9 +2719,7 @@ int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 {
 	u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 len = bbio->bio.bi_iter.bi_size;
-	struct btrfs_inode *inode = bbio->inode;
-	u64 ordered_len = ordered->num_bytes;
-	int ret = 0;
+	int ret;
 
 	/* Must always be called for the beginning of an ordered extent. */
 	if (WARN_ON_ONCE(start != ordered->disk_bytenr))
@@ -2731,18 +2729,18 @@ int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 	if (ordered->disk_num_bytes == len)
 		return 0;
 
-	ret = btrfs_split_ordered_extent(ordered, len);
-	if (ret)
-		return ret;
-
 	/*
 	 * Don't split the extent_map for NOCOW extents, as we're writing into
 	 * a pre-existing one.
 	 */
-	if (test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
-		return 0;
+	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) {
+		ret = split_extent_map(bbio->inode, bbio->file_offset,
+				       ordered->num_bytes, len);
+		if (ret)
+			return ret;
+	}
 
-	return split_extent_map(inode, bbio->file_offset, ordered_len, len);
+	return btrfs_split_ordered_extent(ordered, len);
 }
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/14] btrfs: return the new ordered_extent from btrfs_split_ordered_extent
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (6 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 07/14] btrfs: reorder btrfs_extract_ordered_extent Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-24 15:03 ` [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io Christoph Hellwig
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

Return the ordered_extent split from the passed in one.  This will be
needed to be able to store an ordered_extent in the btrfs_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/inode.c        |  7 ++++++-
 fs/btrfs/ordered-data.c | 19 ++++++++++---------
 fs/btrfs/ordered-data.h |  3 ++-
 3 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5de029ef0b399c..31124954ec6ecf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2719,6 +2719,7 @@ int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 {
 	u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 len = bbio->bio.bi_iter.bi_size;
+	struct btrfs_ordered_extent *new;
 	int ret;
 
 	/* Must always be called for the beginning of an ordered extent. */
@@ -2740,7 +2741,11 @@ int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 			return ret;
 	}
 
-	return btrfs_split_ordered_extent(ordered, len);
+	new = btrfs_split_ordered_extent(ordered, len);
+	if (IS_ERR(new))
+		return PTR_ERR(new);
+	btrfs_put_ordered_extent(new);
+	return 0;
 }
 
 /*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 324a5a8c844a72..997d79046d9b62 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1116,7 +1116,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 }
 
 /* Split out a new ordered extent for this first @len bytes of @ordered. */
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
+struct btrfs_ordered_extent *
+btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 {
 	struct inode *inode = ordered->inode;
 	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
@@ -1135,16 +1136,16 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 	 * reduce the original extent to a zero length either.
 	 */
 	if (WARN_ON_ONCE(len >= ordered->num_bytes))
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 	/* We cannot split once ordered extent is past end_bio. */
 	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes))
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 	/* We cannot split a compressed ordered extent. */
 	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes))
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 	/* Checksum list should be empty. */
 	if (WARN_ON_ONCE(!list_empty(&ordered->list)))
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 
 	spin_lock_irq(&tree->lock);
 	/* Remove from tree once */
@@ -1171,13 +1172,13 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 
 	/*
 	 * The splitting extent is already counted and will be added again in
-	 * btrfs_add_ordered_extent(). Subtract len to avoid double counting.
+	 * btrfs_alloc_ordered_extent(). Subtract len to avoid double counting.
 	 */
 	percpu_counter_add_batch(&fs_info->ordered_bytes, -len, fs_info->delalloc_batch);
 
-	return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len,
-					disk_bytenr, len, 0, flags,
-					ordered->compress_type);
+	return btrfs_alloc_ordered_extent(BTRFS_I(inode), file_offset, len, len,
+					  disk_bytenr, len, 0, flags,
+					  ordered->compress_type);
 }
 
 int __init ordered_data_init(void)
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index dc700aa515b58b..2a20017d9ec6f5 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -206,7 +206,8 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					struct extent_state **cached_state);
 bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct extent_state **cached_state);
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len);
+struct btrfs_ordered_extent *
+btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (7 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 08/14] btrfs: return the new ordered_extent from btrfs_split_ordered_extent Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25 11:02   ` Johannes Thumshirn
  2023-05-30 15:44   ` David Sterba
  2023-05-24 15:03 ` [PATCH 10/14] btrfs: split btrfs_alloc_ordered_extent Christoph Hellwig
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

The callers don't check the btrfs_finish_ordered_io return value, so
drop it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c        | 4 +---
 fs/btrfs/ordered-data.h | 2 +-
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 31124954ec6ecf..cee71eaec7cff9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3180,7 +3180,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
  * an ordered extent if the range of bytes in the file it covers are
  * fully written.
  */
-int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
+void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 {
 	struct btrfs_inode *inode = BTRFS_I(ordered_extent->inode);
 	struct btrfs_root *root = inode->root;
@@ -3383,8 +3383,6 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	btrfs_put_ordered_extent(ordered_extent);
 	/* once for the tree */
 	btrfs_put_ordered_extent(ordered_extent);
-
-	return ret;
 }
 
 void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 2a20017d9ec6f5..2c6efebd043c04 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -161,7 +161,7 @@ btrfs_ordered_inode_tree_init(struct btrfs_ordered_inode_tree *t)
 	t->last = NULL;
 }
 
-int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
+void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
 
 void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
 void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/14] btrfs: split btrfs_alloc_ordered_extent
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (8 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25 12:09   ` Johannes Thumshirn
  2023-05-24 15:03 ` [PATCH 11/14] btrfs: atomically insert the new extent in btrfs_split_ordered_extent Christoph Hellwig
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

Split two low-level helpers out of btrfs_alloc_ordered_extent to allocate
and insert the logic extent.  The pure alloc helper will be used to
improve btrfs_split_ordered_extent.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ordered-data.c | 114 +++++++++++++++++++++++-----------------
 1 file changed, 65 insertions(+), 49 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 997d79046d9b62..54783f67f479ad 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -146,35 +146,11 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
 	return ret;
 }
 
-/*
- * Add an ordered extent to the per-inode tree.
- *
- * @inode:           Inode that this extent is for.
- * @file_offset:     Logical offset in file where the extent starts.
- * @num_bytes:       Logical length of extent in file.
- * @ram_bytes:       Full length of unencoded data.
- * @disk_bytenr:     Offset of extent on disk.
- * @disk_num_bytes:  Size of extent on disk.
- * @offset:          Offset into unencoded data where file data starts.
- * @flags:           Flags specifying type of extent (1 << BTRFS_ORDERED_*).
- * @compress_type:   Compression algorithm used for data.
- *
- * Most of these parameters correspond to &struct btrfs_file_extent_item. The
- * tree is given a single reference on the ordered extent that was inserted, and
- * the returned pointer is given a second reference.
- *
- * Return: the new ordered extent or error pointer.
- */
-struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
-			struct btrfs_inode *inode, u64 file_offset,
-			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			u64 disk_num_bytes, u64 offset, unsigned long flags,
-			int compress_type)
+static struct btrfs_ordered_extent *
+alloc_ordered_extent(struct btrfs_inode *inode, u64 file_offset, u64 num_bytes,
+		     u64 ram_bytes, u64 disk_bytenr, u64 disk_num_bytes,
+		     u64 offset, unsigned long flags, int compress_type)
 {
-	struct btrfs_root *root = inode->root;
-	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
-	struct rb_node *node;
 	struct btrfs_ordered_extent *entry;
 	int ret;
 
@@ -184,7 +160,6 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
 		if (ret < 0)
 			return ERR_PTR(ret);
-		ret = 0;
 	} else {
 		/*
 		 * The ordered extent has reserved qgroup space, release now
@@ -209,14 +184,7 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
 	entry->qgroup_rsv = ret;
-
-	ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
 	entry->flags = flags;
-
-	percpu_counter_add_batch(&fs_info->ordered_bytes, num_bytes,
-				 fs_info->delalloc_batch);
-
-	/* one ref for the tree */
 	refcount_set(&entry->refs, 1);
 	init_waitqueue_head(&entry->wait);
 	INIT_LIST_HEAD(&entry->list);
@@ -225,15 +193,40 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 	INIT_LIST_HEAD(&entry->work_list);
 	init_completion(&entry->completion);
 
+	/*
+	 * We don't need the count_max_extents here, we can assume that all of
+	 * that work has been done at higher layers, so this is truly the
+	 * smallest the extent is going to get.
+	 */
+	spin_lock(&inode->lock);
+	btrfs_mod_outstanding_extents(inode, 1);
+	spin_unlock(&inode->lock);
+
+	return entry;
+}
+
+static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
+{
+	struct btrfs_inode *inode = BTRFS_I(entry->inode);
+	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct rb_node *node;
+
 	trace_btrfs_ordered_extent_add(inode, entry);
 
+	percpu_counter_add_batch(&fs_info->ordered_bytes, entry->num_bytes,
+				 fs_info->delalloc_batch);
+
+	/* one ref for the tree */
+	refcount_inc(&entry->refs);
+
 	spin_lock_irq(&tree->lock);
-	node = tree_insert(&tree->tree, file_offset,
-			   &entry->rb_node);
+	node = tree_insert(&tree->tree, entry->file_offset, &entry->rb_node);
 	if (node)
 		btrfs_panic(fs_info, -EEXIST,
 				"inconsistency in ordered tree at offset %llu",
-				file_offset);
+				entry->file_offset);
 	spin_unlock_irq(&tree->lock);
 
 	spin_lock(&root->ordered_extent_lock);
@@ -247,19 +240,42 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 		spin_unlock(&fs_info->ordered_root_lock);
 	}
 	spin_unlock(&root->ordered_extent_lock);
+}
 
-	/*
-	 * We don't need the count_max_extents here, we can assume that all of
-	 * that work has been done at higher layers, so this is truly the
-	 * smallest the extent is going to get.
-	 */
-	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, 1);
-	spin_unlock(&inode->lock);
+/*
+ * Add an ordered extent to the per-inode tree.
+ *
+ * @inode:           Inode that this extent is for.
+ * @file_offset:     Logical offset in file where the extent starts.
+ * @num_bytes:       Logical length of extent in file.
+ * @ram_bytes:       Full length of unencoded data.
+ * @disk_bytenr:     Offset of extent on disk.
+ * @disk_num_bytes:  Size of extent on disk.
+ * @offset:          Offset into unencoded data where file data starts.
+ * @flags:           Flags specifying type of extent (1 << BTRFS_ORDERED_*).
+ * @compress_type:   Compression algorithm used for data.
+ *
+ * Most of these parameters correspond to &struct btrfs_file_extent_item. The
+ * tree is given a single reference on the ordered extent that was inserted, and
+ * the returned pointer is given a second reference.
+ *
+ * Return: the new ordered extent or error pointer.
+ */
+struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
+			struct btrfs_inode *inode, u64 file_offset,
+			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			u64 disk_num_bytes, u64 offset, unsigned long flags,
+			int compress_type)
+{
+	struct btrfs_ordered_extent *entry;
 
-	/* One ref for the returned entry to match semantics of lookup. */
-	refcount_inc(&entry->refs);
+	ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
 
+	entry = alloc_ordered_extent(inode, file_offset, num_bytes, ram_bytes,
+				     disk_bytenr, disk_num_bytes, offset, flags,
+				     compress_type);
+	if (!IS_ERR(entry))
+		insert_ordered_extent(entry);
 	return entry;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/14] btrfs: atomically insert the new extent in btrfs_split_ordered_extent
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (9 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 10/14] btrfs: split btrfs_alloc_ordered_extent Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25 12:30   ` Johannes Thumshirn
  2023-05-24 15:03 ` [PATCH 12/14] btrfs: handle completed ordered extents " Christoph Hellwig
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

Currently there is a small race window in btrfs_split_ordered_extent,
where the reduced old extent can be looked up on the per-inode rbtree
or the per-root list while the newly split out one isn't visible yet.

Fix this by open coding btrfs_alloc_ordered_extent in
btrfs_split_ordered_extent, and holding the tree lock and
root->ordered_extent_lock over the entire tree and extent manipulation.

Note that this introduces new lock ordering because previously
ordered_extent_lock was never held over the tree lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ordered-data.c | 43 ++++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 54783f67f479ad..bf0a0d67306649 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1135,15 +1135,17 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 struct btrfs_ordered_extent *
 btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 {
-	struct inode *inode = ordered->inode;
-	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_inode *inode = BTRFS_I(ordered->inode);
+	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 file_offset = ordered->file_offset;
 	u64 disk_bytenr = ordered->disk_bytenr;
 	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
+	struct btrfs_ordered_extent *new;
 	struct rb_node *node;
 
-	trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
+	trace_btrfs_ordered_extent_split(inode, ordered);
 
 	ASSERT(!(flags & (1U << BTRFS_ORDERED_COMPRESSED)));
 
@@ -1163,7 +1165,16 @@ btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 	if (WARN_ON_ONCE(!list_empty(&ordered->list)))
 		return ERR_PTR(-EINVAL);
 
-	spin_lock_irq(&tree->lock);
+	new = alloc_ordered_extent(inode, file_offset, len, len, disk_bytenr,
+				   len, 0, flags, ordered->compress_type);
+	if (IS_ERR(new))
+		return new;
+
+	/* one ref for the tree */
+	refcount_inc(&new->refs);
+
+	spin_lock_irq(&root->ordered_extent_lock);
+	spin_lock(&tree->lock);
 	/* Remove from tree once */
 	node = &ordered->rb_node;
 	rb_erase(node, &tree->tree);
@@ -1182,19 +1193,19 @@ btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 	if (node)
 		btrfs_panic(fs_info, -EEXIST,
 			"zoned: inconsistency in ordered tree at offset %llu",
-			    ordered->file_offset);
+			ordered->file_offset);
 
-	spin_unlock_irq(&tree->lock);
-
-	/*
-	 * The splitting extent is already counted and will be added again in
-	 * btrfs_alloc_ordered_extent(). Subtract len to avoid double counting.
-	 */
-	percpu_counter_add_batch(&fs_info->ordered_bytes, -len, fs_info->delalloc_batch);
+	node = tree_insert(&tree->tree, new->file_offset, &new->rb_node);
+	if (node)
+		btrfs_panic(fs_info, -EEXIST,
+			"zoned: inconsistency in ordered tree at offset %llu",
+			new->file_offset);
+	spin_unlock(&tree->lock);
 
-	return btrfs_alloc_ordered_extent(BTRFS_I(inode), file_offset, len, len,
-					  disk_bytenr, len, 0, flags,
-					  ordered->compress_type);
+	list_add_tail(&new->root_extent_list, &root->ordered_extents);
+	root->nr_ordered_extents++;
+	spin_unlock_irq(&root->ordered_extent_lock);
+	return new;
 }
 
 int __init ordered_data_init(void)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 12/14] btrfs: handle completed ordered extents in btrfs_split_ordered_extent
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (10 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 11/14] btrfs: atomically insert the new extent in btrfs_split_ordered_extent Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25 13:06   ` Johannes Thumshirn
  2023-05-24 15:03 ` [PATCH 13/14] btrfs: defer splitting of ordered extents until I/O completion Christoph Hellwig
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

To delay splitting ordered_extents to I/O completion time we need to be
able to handle fully completed ordered extents in
btrfs_split_ordered_extent.  Besides a bit of accounting this primarily
involved moving over the csums to thew split bio for the range that
it covers, which is simple enough because we always have one
btrfs_ordered_sum per bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ordered-data.c | 40 ++++++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index bf0a0d67306649..45364b75a9053f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1141,9 +1141,11 @@ btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 file_offset = ordered->file_offset;
 	u64 disk_bytenr = ordered->disk_bytenr;
-	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
+	unsigned long flags = ordered->flags;
+	struct btrfs_ordered_sum *sum, *n;
 	struct btrfs_ordered_extent *new;
 	struct rb_node *node;
+	u64 offset = 0;
 
 	trace_btrfs_ordered_extent_split(inode, ordered);
 
@@ -1155,15 +1157,15 @@ btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 	 */
 	if (WARN_ON_ONCE(len >= ordered->num_bytes))
 		return ERR_PTR(-EINVAL);
-	/* We cannot split once ordered extent is past end_bio. */
-	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes))
-		return ERR_PTR(-EINVAL);
+	/* We cannot split partially completed ordered extents. */
+	if (ordered->bytes_left) {
+		ASSERT(!(flags & ~BTRFS_ORDERED_TYPE_FLAGS));
+		if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes))
+			return ERR_PTR(-EINVAL);
+	}
 	/* We cannot split a compressed ordered extent. */
 	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes))
 		return ERR_PTR(-EINVAL);
-	/* Checksum list should be empty. */
-	if (WARN_ON_ONCE(!list_empty(&ordered->list)))
-		return ERR_PTR(-EINVAL);
 
 	new = alloc_ordered_extent(inode, file_offset, len, len, disk_bytenr,
 				   len, 0, flags, ordered->compress_type);
@@ -1186,7 +1188,29 @@ btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 	ordered->disk_bytenr += len;
 	ordered->num_bytes -= len;
 	ordered->disk_num_bytes -= len;
-	ordered->bytes_left -= len;
+
+	if (test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags)) {
+		ASSERT(ordered->bytes_left == 0);
+		new->bytes_left = 0;
+	} else {
+		ordered->bytes_left -= len;
+	}
+
+	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags)) {
+		if (ordered->truncated_len > len) {
+			ordered->truncated_len -= len;
+		} else {
+			new->truncated_len = ordered->truncated_len;
+			ordered->truncated_len = 0;
+		}
+	}
+
+	list_for_each_entry_safe(sum, n, &ordered->list, list) {
+		if (offset == len)
+			break;
+		list_move_tail(&sum->list, &new->list);
+		offset += sum->len;
+	}
 
 	/* Re-insert the node */
 	node = tree_insert(&tree->tree, ordered->file_offset, &ordered->rb_node);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 13/14] btrfs: defer splitting of ordered extents until I/O completion
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (11 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 12/14] btrfs: handle completed ordered extents " Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25 16:25   ` Johannes Thumshirn
  2023-05-30 18:40   ` David Sterba
  2023-05-24 15:03 ` [PATCH 14/14] btrfs: pass the new logical address to split_extent_map Christoph Hellwig
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

The btrfs zoned completion code currently needs an ordered_extent and
extent_map per bio so that it can account for the non-predictable
write location from Zone Append.  To archive that it currently splits
the ordered_extent and extent_map at I/O submission time, and then
records the actual physical address in the ->physical field of the
ordered_extent.

This patch instead switches to record the "original" physical address
that the btrfs allocator assigned in spare space in the btrfs_bio,
and then rewrites the logical address in the btrfs_ordered_sum
structure at I/O completion time.  This allows the ordered extent
completion handler to simply walk the list of ordered csums and
split the ordered extent as needed.  This removes an extra ordered
extent and extent_map lookup and manipulation during the I/O
submission path, and instead batches it in the I/O completion path
where we need to touch these anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/bio.c          | 17 ------------
 fs/btrfs/btrfs_inode.h  |  2 --
 fs/btrfs/inode.c        | 18 ++++++++-----
 fs/btrfs/ordered-data.h |  1 +
 fs/btrfs/zoned.c        | 57 ++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/zoned.h        |  6 ++---
 6 files changed, 65 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 8a4d3b707dd1b2..ae6345668d2d01 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -61,20 +61,6 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
 	return bbio;
 }
 
-static blk_status_t btrfs_bio_extract_ordered_extent(struct btrfs_bio *bbio)
-{
-	struct btrfs_ordered_extent *ordered;
-	int ret;
-
-	ordered = btrfs_lookup_ordered_extent(bbio->inode, bbio->file_offset);
-	if (WARN_ON_ONCE(!ordered))
-		return BLK_STS_IOERR;
-	ret = btrfs_extract_ordered_extent(bbio, ordered);
-	btrfs_put_ordered_extent(ordered);
-
-	return errno_to_blk_status(ret);
-}
-
 static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
 					 struct btrfs_bio *orig_bbio,
 					 u64 map_length, bool use_append)
@@ -667,9 +653,6 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 		if (use_append) {
 			bio->bi_opf &= ~REQ_OP_WRITE;
 			bio->bi_opf |= REQ_OP_ZONE_APPEND;
-			ret = btrfs_bio_extract_ordered_extent(bbio);
-			if (ret)
-				goto fail_put_bio;
 		}
 
 		/*
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 08c99602339408..8abf96cfea8fae 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -410,8 +410,6 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
 
 int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
 			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
-int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
-				 struct btrfs_ordered_extent *ordered);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cee71eaec7cff9..eee4eefb279780 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2714,8 +2714,8 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 	}
 }
 
-int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
-				 struct btrfs_ordered_extent *ordered)
+static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
+					struct btrfs_ordered_extent *ordered)
 {
 	u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 len = bbio->bio.bi_iter.bi_size;
@@ -3180,7 +3180,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
  * an ordered extent if the range of bytes in the file it covers are
  * fully written.
  */
-void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
+void btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 {
 	struct btrfs_inode *inode = BTRFS_I(ordered_extent->inode);
 	struct btrfs_root *root = inode->root;
@@ -3215,11 +3215,9 @@ void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
-	if (btrfs_is_zoned(fs_info)) {
-		btrfs_rewrite_logical_zoned(ordered_extent);
+	if (btrfs_is_zoned(fs_info))
 		btrfs_zone_finish_endio(fs_info, ordered_extent->disk_bytenr,
 					ordered_extent->disk_num_bytes);
-	}
 
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
@@ -3385,6 +3383,14 @@ void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	btrfs_put_ordered_extent(ordered_extent);
 }
 
+void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
+{
+	if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
+	    !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
+		btrfs_finish_ordered_zoned(ordered);
+	btrfs_finish_one_ordered(ordered);
+}
+
 void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 					  struct page *page, u64 start,
 					  u64 end, bool uptodate)
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 2c6efebd043c04..6d1de157792741 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -161,6 +161,7 @@ btrfs_ordered_inode_tree_init(struct btrfs_ordered_inode_tree *t)
 	t->last = NULL;
 }
 
+void btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent);
 void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
 
 void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index e838c2037634c2..92363fafc3a648 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -15,6 +15,7 @@
 #include "transaction.h"
 #include "dev-replace.h"
 #include "space-info.h"
+#include "super.h"
 #include "fs.h"
 #include "accessors.h"
 #include "bio.h"
@@ -1665,16 +1666,11 @@ void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 		sum->logical += physical - bbio->orig_physical;
 }
 
-void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
+static void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered,
+					u64 logical)
 {
 	struct extent_map_tree *em_tree = &BTRFS_I(ordered->inode)->extent_tree;
 	struct extent_map *em;
-	struct btrfs_ordered_sum *sum =
-		list_first_entry(&ordered->list, typeof(*sum), list);
-	u64 logical = sum->logical;
-
-	if (ordered->disk_bytenr == logical)
-		return;
 
 	ordered->disk_bytenr = logical;
 
@@ -1686,6 +1682,53 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 	write_unlock(&em_tree->lock);
 }
 
+static bool btrfs_zoned_split_ordered(struct btrfs_ordered_extent *ordered,
+				      u64 logical, u64 len)
+{
+	struct btrfs_ordered_extent *new;
+
+	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags) &&
+	    split_extent_map(BTRFS_I(ordered->inode), ordered->file_offset,
+			     ordered->num_bytes, len))
+		return false;
+
+	new = btrfs_split_ordered_extent(ordered, len);
+	if (IS_ERR(new))
+		return false;
+
+	if (new->disk_bytenr != logical)
+		btrfs_rewrite_logical_zoned(new, logical);
+	btrfs_finish_one_ordered(new);
+	return true;
+}
+
+void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(ordered->inode->i_sb);
+	struct btrfs_ordered_sum *sum =
+		list_first_entry(&ordered->list, typeof(*sum), list);
+	u64 logical = sum->logical;
+	u64 len = sum->len;
+
+	while (len < ordered->disk_num_bytes) {
+		sum = list_next_entry(sum, list);
+		if (sum->logical == logical + len) {
+			len += sum->len;
+			continue;
+		}
+		if (!btrfs_zoned_split_ordered(ordered, logical, len)) {
+			set_bit(BTRFS_ORDERED_IOERR, &ordered->flags);
+			btrfs_err(fs_info, "failed to split ordered extent\n");
+			return;
+		}
+		logical = sum->logical;
+		len = sum->len;
+	}
+
+	if (ordered->disk_bytenr != logical)
+		btrfs_rewrite_logical_zoned(ordered, logical);
+}
+
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret)
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 3058ef559c9813..27322b926038c2 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -30,6 +30,8 @@ struct btrfs_zoned_device_info {
 	struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
 };
 
+void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered);
+
 #ifdef CONFIG_BLK_DEV_ZONED
 int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 		       struct blk_zone *zone);
@@ -56,7 +58,6 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 bool btrfs_use_zone_append(struct btrfs_bio *bbio);
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
-void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret);
@@ -188,9 +189,6 @@ static inline void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 {
 }
 
-static inline void btrfs_rewrite_logical_zoned(
-				struct btrfs_ordered_extent *ordered) { }
-
 static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 			       struct extent_buffer *eb,
 			       struct btrfs_block_group **cache_ret)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 14/14] btrfs: pass the new logical address to split_extent_map
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (12 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 13/14] btrfs: defer splitting of ordered extents until I/O completion Christoph Hellwig
@ 2023-05-24 15:03 ` Christoph Hellwig
  2023-05-25 16:28   ` Johannes Thumshirn
  2023-05-30 13:21 ` don't split ordered_extents for zoned writes at I/O submission time Johannes Thumshirn
  2023-05-30 18:48 ` David Sterba
  15 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-24 15:03 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs

split_extent_map splits off the first chunk of an extent map into a new
one.  One of the two users is the zoned I/O completion code that wants to
rewrite the logical block start address right after this split.  Pass in
the logical address to be set in the split off first extent_map as an
argument to avoid an extra extent tree lookup for this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_map.c | 8 +++++---
 fs/btrfs/extent_map.h | 3 ++-
 fs/btrfs/inode.c      | 3 ++-
 fs/btrfs/zoned.c      | 6 ++----
 4 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 21cc4991360221..b7373fb03898f2 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -961,11 +961,13 @@ int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 }
 
 /*
- * Split off the first pre bytes from the extent_map at [start, start + len]
+ * Split off the first pre bytes from the extent_map at [start, start + len],
+ * and set the block_addr for it to new_logical.
  *
  * This function is used when an ordered_extent needs to be split.
  */
-int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre)
+int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
+		     u64 new_logical)
 {
 	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
@@ -1008,7 +1010,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre)
 	split_pre->start = em->start;
 	split_pre->len = pre;
 	split_pre->orig_start = split_pre->start;
-	split_pre->block_start = em->block_start;
+	split_pre->block_start = new_logical;
 	split_pre->block_len = split_pre->len;
 	split_pre->orig_block_len = split_pre->block_len;
 	split_pre->ram_bytes = split_pre->len;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 7df39112388dde..35d27c756e0808 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -90,7 +90,8 @@ struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
 int add_extent_mapping(struct extent_map_tree *tree,
 		       struct extent_map *em, int modified);
 void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
-int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre);
+int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
+		     u64 new_logical);
 
 struct extent_map *alloc_extent_map(void);
 void free_extent_map(struct extent_map *em);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eee4eefb279780..5e2315d78ea0cc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2736,7 +2736,8 @@ static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 	 */
 	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) {
 		ret = split_extent_map(bbio->inode, bbio->file_offset,
-				       ordered->num_bytes, len);
+				       ordered->num_bytes, len,
+				       ordered->disk_bytenr);
 		if (ret)
 			return ret;
 	}
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 92363fafc3a648..bc437259d2a06a 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1689,15 +1689,13 @@ static bool btrfs_zoned_split_ordered(struct btrfs_ordered_extent *ordered,
 
 	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags) &&
 	    split_extent_map(BTRFS_I(ordered->inode), ordered->file_offset,
-			     ordered->num_bytes, len))
+			     ordered->num_bytes, len, logical))
 		return false;
 
 	new = btrfs_split_ordered_extent(ordered, len);
 	if (IS_ERR(new))
 		return false;
-
-	if (new->disk_bytenr != logical)
-		btrfs_rewrite_logical_zoned(new, logical);
+	new->disk_bytenr = logical;
 	btrfs_finish_one_ordered(new);
 	return true;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/14] btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED
  2023-05-24 15:03 ` [PATCH 01/14] btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED Christoph Hellwig
@ 2023-05-25  8:33   ` Johannes Thumshirn
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25  8:33 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/14] btrfs: don't call btrfs_record_physical_zoned for failed append
  2023-05-24 15:03 ` [PATCH 02/14] btrfs: don't call btrfs_record_physical_zoned for failed append Christoph Hellwig
@ 2023-05-25  8:33   ` Johannes Thumshirn
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25  8:33 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/14] btrfs: mark the len field in struct btrfs_ordered_sum as unsigned
  2023-05-24 15:03 ` [PATCH 03/14] btrfs: mark the len field in struct btrfs_ordered_sum as unsigned Christoph Hellwig
@ 2023-05-25  8:33   ` Johannes Thumshirn
  2023-05-30 16:45   ` David Sterba
  1 sibling, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25  8:33 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/14] btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical
  2023-05-24 15:03 ` [PATCH 04/14] btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical Christoph Hellwig
@ 2023-05-25  8:33   ` Johannes Thumshirn
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25  8:33 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/14] btrfs: optimize the logical to physical mapping for zoned writes
  2023-05-24 15:03 ` [PATCH 05/14] btrfs: optimize the logical to physical mapping for zoned writes Christoph Hellwig
@ 2023-05-25 10:56   ` Johannes Thumshirn
  2023-08-18 14:03   ` Naohiro Aota
  1 sibling, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 10:56 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/14] btrfs: move split_extent_map to extent_map.c
  2023-05-24 15:03 ` [PATCH 06/14] btrfs: move split_extent_map to extent_map.c Christoph Hellwig
@ 2023-05-25 10:58   ` Johannes Thumshirn
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 10:58 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

On 24.05.23 17:03, Christoph Hellwig wrote:
> +out_free_pre:
> +	free_extent_map(split_pre);
> +	return ret;
> +}
> +

Nit: additional newline

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io
  2023-05-24 15:03 ` [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io Christoph Hellwig
@ 2023-05-25 11:02   ` Johannes Thumshirn
  2023-05-30 15:44   ` David Sterba
  1 sibling, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 11:02 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/14] btrfs: split btrfs_alloc_ordered_extent
  2023-05-24 15:03 ` [PATCH 10/14] btrfs: split btrfs_alloc_ordered_extent Christoph Hellwig
@ 2023-05-25 12:09   ` Johannes Thumshirn
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 12:09 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/14] btrfs: atomically insert the new extent in btrfs_split_ordered_extent
  2023-05-24 15:03 ` [PATCH 11/14] btrfs: atomically insert the new extent in btrfs_split_ordered_extent Christoph Hellwig
@ 2023-05-25 12:30   ` Johannes Thumshirn
  2023-05-25 12:34     ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 12:30 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

On 24.05.23 17:04, Christoph Hellwig wrote:
> @@ -1182,19 +1193,19 @@ btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
>  	if (node)
>  		btrfs_panic(fs_info, -EEXIST,
>  			"zoned: inconsistency in ordered tree at offset %llu",
> -			    ordered->file_offset);
> +			ordered->file_offset);
>  
> -	spin_unlock_irq(&tree->lock);
> -
> -	/*
> -	 * The splitting extent is already counted and will be added again in
> -	 * btrfs_alloc_ordered_extent(). Subtract len to avoid double counting.
> -	 */
> -	percpu_counter_add_batch(&fs_info->ordered_bytes, -len, fs_info->delalloc_batch);
> +	node = tree_insert(&tree->tree, new->file_offset, &new->rb_node);
> +	if (node)
> +		btrfs_panic(fs_info, -EEXIST,
> +			"zoned: inconsistency in ordered tree at offset %llu",
> +			new->file_offset);
> +	spin_unlock(&tree->lock);
>  
> -	return btrfs_alloc_ordered_extent(BTRFS_I(inode), file_offset, len, len,
> -					  disk_bytenr, len, 0, flags,
> -					  ordered->compress_type);
> +	list_add_tail(&new->root_extent_list, &root->ordered_extents);
> +	root->nr_ordered_extents++;
> +	spin_unlock_irq(&root->ordered_extent_lock);
> +	return new;
>  }

I wonder if we couldn't reduce the code duplication between btrfs_split_ordered_extent
and the new insert_ordered_extent function. The different lock ordering currently makes
it impossible, though.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/14] btrfs: atomically insert the new extent in btrfs_split_ordered_extent
  2023-05-25 12:30   ` Johannes Thumshirn
@ 2023-05-25 12:34     ` Christoph Hellwig
  2023-05-25 16:23       ` Johannes Thumshirn
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-25 12:34 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Naohiro Aota, linux-btrfs

On Thu, May 25, 2023 at 12:30:41PM +0000, Johannes Thumshirn wrote:
> I wonder if we couldn't reduce the code duplication between btrfs_split_ordered_extent
> and the new insert_ordered_extent function. The different lock ordering currently makes
> it impossible, though.

The interesting thing about the split case is that we really want to
do a removal and two inserts in an atomic fashion.  In the end
there's not really much code in insert_ordered_extent anyway, and
the decision where to split up btrfs_alloc_ordered_extent was at
least partially based on that tradeoff.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 12/14] btrfs: handle completed ordered extents in btrfs_split_ordered_extent
  2023-05-24 15:03 ` [PATCH 12/14] btrfs: handle completed ordered extents " Christoph Hellwig
@ 2023-05-25 13:06   ` Johannes Thumshirn
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 13:06 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/14] btrfs: atomically insert the new extent in btrfs_split_ordered_extent
  2023-05-25 12:34     ` Christoph Hellwig
@ 2023-05-25 16:23       ` Johannes Thumshirn
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 16:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Naohiro Aota, linux-btrfs

On 25.05.23 14:34, Christoph Hellwig wrote:
> On Thu, May 25, 2023 at 12:30:41PM +0000, Johannes Thumshirn wrote:
>> I wonder if we couldn't reduce the code duplication between btrfs_split_ordered_extent
>> and the new insert_ordered_extent function. The different lock ordering currently makes
>> it impossible, though.
> 
> The interesting thing about the split case is that we really want to
> do a removal and two inserts in an atomic fashion.  In the end
> there's not really much code in insert_ordered_extent anyway, and
> the decision where to split up btrfs_alloc_ordered_extent was at
> least partially based on that tradeoff.
> 

Yes unfortunately :(

Anyways,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 13/14] btrfs: defer splitting of ordered extents until I/O completion
  2023-05-24 15:03 ` [PATCH 13/14] btrfs: defer splitting of ordered extents until I/O completion Christoph Hellwig
@ 2023-05-25 16:25   ` Johannes Thumshirn
  2023-05-30 18:40   ` David Sterba
  1 sibling, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 16:25 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 14/14] btrfs: pass the new logical address to split_extent_map
  2023-05-24 15:03 ` [PATCH 14/14] btrfs: pass the new logical address to split_extent_map Christoph Hellwig
@ 2023-05-25 16:28   ` Johannes Thumshirn
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-25 16:28 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: don't split ordered_extents for zoned writes at I/O submission time
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (13 preceding siblings ...)
  2023-05-24 15:03 ` [PATCH 14/14] btrfs: pass the new logical address to split_extent_map Christoph Hellwig
@ 2023-05-30 13:21 ` Johannes Thumshirn
  2023-05-30 14:20   ` Christoph Hellwig
  2023-05-30 18:48 ` David Sterba
  15 siblings, 1 reply; 37+ messages in thread
From: Johannes Thumshirn @ 2023-05-30 13:21 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Naohiro Aota, linux-btrfs, Matias Bjørling, Damien Le Moal

On 24.05.23 17:03, Christoph Hellwig wrote:
> Hi all,
> 
> this series removes the unconditional splitting of ordered extents for
> zoned writes at I/O submission time, and instead splits them only when
> actually needed in the I/O completion handler.
> 
> This something I've been wanting to do for a while as it is a lot more
> efficient, but it is also really needed to make the ordered_extent
> pointer in the btrfs_bio work for zoned file systems, which made it a bit
> more urgent.  This series also borrow two patches from the series that
> adds the ordered_extent pointer to the btrfs_bio.
> 
> Currently due to the submission time splitting, there is one extra lookup
> in both the ordered extent tree and inode extent tree for each bio
> submission, and extra ordered extent lookup in the bio completion
> handler, and two extent tree lookups per bio / split ordered extent in
> the ordered extent completion handler.  With this series this is reduced
> to a single inode extent tree lookup for the best case, with an extra
> inode extent tree and ordered extent lookup when a split actually needs
> to occur due to reordering inside an ordered_extent.

Btw, as this came up multiple times today, should we also add a patch on top
of this series merging consecutive ordered extents if the on disk extents are
consecutive as well?

This would reduce fragmentation that unfortunately is a byproduct of the
Zone Append writing scheme.

I can take a look at this if you want?

Byte,
	Johannes


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: don't split ordered_extents for zoned writes at I/O submission time
  2023-05-30 13:21 ` don't split ordered_extents for zoned writes at I/O submission time Johannes Thumshirn
@ 2023-05-30 14:20   ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-30 14:20 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Naohiro Aota, linux-btrfs, Matias Bjørling, Damien Le Moal

On Tue, May 30, 2023 at 01:21:14PM +0000, Johannes Thumshirn wrote:
> Btw, as this came up multiple times today, should we also add a patch on top
> of this series merging consecutive ordered extents if the on disk extents are
> consecutive as well?
> 
> This would reduce fragmentation that unfortunately is a byproduct of the
> Zone Append writing scheme.

With this series we don't split ordered extents for zoned writes unless
they are non-consecutive.  So I'm not sure what we should merge, as
there isn't any splitting going on unless either a block group didn't
have space, or zone append actually reordered.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io
  2023-05-24 15:03 ` [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io Christoph Hellwig
  2023-05-25 11:02   ` Johannes Thumshirn
@ 2023-05-30 15:44   ` David Sterba
  2023-05-31  4:00     ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: David Sterba @ 2023-05-30 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn,
	Naohiro Aota, linux-btrfs

On Wed, May 24, 2023 at 05:03:12PM +0200, Christoph Hellwig wrote:
> The callers don't check the btrfs_finish_ordered_io return value, so
> drop it.

Same general comments like in
https://lore.kernel.org/linux-btrfs/20230530150359.GS575@twin.jikos.cz/

"Function can return void if none of its callees return an error,
 directly or indirectly, there are no BUG_ONs left to be turned to
 proper error handling or there's no missing error handling"

btrfs_finish_ordered_io mixes a few error handling styles, there's
direct return -ERROR, transaction abort or mapping_set_error. Some
called functions are not error handling everything propely and at least
btrfs_free_reserved_extent() returns an error but is not handled.

I'm not counting the state bit handlers (clear_extent_bit) as we know
they "should not fail". unpin_extent_cache() does not look clean either.

If 'callers don't check error values' the question is 'Should they?'.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/14] btrfs: mark the len field in struct btrfs_ordered_sum as unsigned
  2023-05-24 15:03 ` [PATCH 03/14] btrfs: mark the len field in struct btrfs_ordered_sum as unsigned Christoph Hellwig
  2023-05-25  8:33   ` Johannes Thumshirn
@ 2023-05-30 16:45   ` David Sterba
  1 sibling, 0 replies; 37+ messages in thread
From: David Sterba @ 2023-05-30 16:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn,
	Naohiro Aota, linux-btrfs

On Wed, May 24, 2023 at 05:03:06PM +0200, Christoph Hellwig wrote:
> len can't ever be negative, so mark it as an u32 instead of int.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/ordered-data.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index f0f1138d23c331..2e54820a5e6ff7 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -20,7 +20,7 @@ struct btrfs_ordered_sum {
>  	/*
>  	 * this is the length in bytes covered by the sums array below.
>  	 */
> -	int len;
> +	u32 len;

Due to the int there was one cast to be removed

--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -561,7 +561,7 @@ int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
                        }
 
                        sums->bytenr = start;
-                       sums->len = (int)size;
+                       sums->len = size;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 13/14] btrfs: defer splitting of ordered extents until I/O completion
  2023-05-24 15:03 ` [PATCH 13/14] btrfs: defer splitting of ordered extents until I/O completion Christoph Hellwig
  2023-05-25 16:25   ` Johannes Thumshirn
@ 2023-05-30 18:40   ` David Sterba
  1 sibling, 0 replies; 37+ messages in thread
From: David Sterba @ 2023-05-30 18:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn,
	Naohiro Aota, linux-btrfs

On Wed, May 24, 2023 at 05:03:16PM +0200, Christoph Hellwig wrote:
> The btrfs zoned completion code currently needs an ordered_extent and
> extent_map per bio so that it can account for the non-predictable
> write location from Zone Append.  To archive that it currently splits
> the ordered_extent and extent_map at I/O submission time, and then
> records the actual physical address in the ->physical field of the
> ordered_extent.
> 
> This patch instead switches to record the "original" physical address
> that the btrfs allocator assigned in spare space in the btrfs_bio,
> and then rewrites the logical address in the btrfs_ordered_sum
> structure at I/O completion time.  This allows the ordered extent
> completion handler to simply walk the list of ordered csums and
> split the ordered extent as needed.  This removes an extra ordered
> extent and extent_map lookup and manipulation during the I/O
> submission path, and instead batches it in the I/O completion path
> where we need to touch these anyway.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/bio.c          | 17 ------------
>  fs/btrfs/btrfs_inode.h  |  2 --
>  fs/btrfs/inode.c        | 18 ++++++++-----
>  fs/btrfs/ordered-data.h |  1 +
>  fs/btrfs/zoned.c        | 57 ++++++++++++++++++++++++++++++++++++-----
>  fs/btrfs/zoned.h        |  6 ++---
>  6 files changed, 65 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 8a4d3b707dd1b2..ae6345668d2d01 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -61,20 +61,6 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
>  	return bbio;
>  }
>  
> -static blk_status_t btrfs_bio_extract_ordered_extent(struct btrfs_bio *bbio)
> -{
> -	struct btrfs_ordered_extent *ordered;
> -	int ret;
> -
> -	ordered = btrfs_lookup_ordered_extent(bbio->inode, bbio->file_offset);
> -	if (WARN_ON_ONCE(!ordered))
> -		return BLK_STS_IOERR;
> -	ret = btrfs_extract_ordered_extent(bbio, ordered);
> -	btrfs_put_ordered_extent(ordered);
> -
> -	return errno_to_blk_status(ret);
> -}
> -
>  static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
>  					 struct btrfs_bio *orig_bbio,
>  					 u64 map_length, bool use_append)
> @@ -667,9 +653,6 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
>  		if (use_append) {
>  			bio->bi_opf &= ~REQ_OP_WRITE;
>  			bio->bi_opf |= REQ_OP_ZONE_APPEND;
> -			ret = btrfs_bio_extract_ordered_extent(bbio);
> -			if (ret)
> -				goto fail_put_bio;
>  		}
>  
>  		/*
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 08c99602339408..8abf96cfea8fae 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -410,8 +410,6 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
>  
>  int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
>  			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
> -int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
> -				 struct btrfs_ordered_extent *ordered);
>  bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
>  			u32 bio_offset, struct bio_vec *bv);
>  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index cee71eaec7cff9..eee4eefb279780 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2714,8 +2714,8 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
>  	}
>  }
>  
> -int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
> -				 struct btrfs_ordered_extent *ordered)
> +static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
> +					struct btrfs_ordered_extent *ordered)
>  {
>  	u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
>  	u64 len = bbio->bio.bi_iter.bi_size;
> @@ -3180,7 +3180,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
>   * an ordered extent if the range of bytes in the file it covers are
>   * fully written.
>   */
> -void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
> +void btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>  {
>  	struct btrfs_inode *inode = BTRFS_I(ordered_extent->inode);
>  	struct btrfs_root *root = inode->root;
> @@ -3215,11 +3215,9 @@ void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
>  		goto out;
>  	}
>  
> -	if (btrfs_is_zoned(fs_info)) {
> -		btrfs_rewrite_logical_zoned(ordered_extent);
> +	if (btrfs_is_zoned(fs_info))
>  		btrfs_zone_finish_endio(fs_info, ordered_extent->disk_bytenr,
>  					ordered_extent->disk_num_bytes);
> -	}
>  
>  	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
>  		truncated = true;
> @@ -3385,6 +3383,14 @@ void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
>  	btrfs_put_ordered_extent(ordered_extent);
>  }
>  
> +void btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
> +{
> +	if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
> +	    !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
> +		btrfs_finish_ordered_zoned(ordered);
> +	btrfs_finish_one_ordered(ordered);

I've left out the void type change of btrfs_finish_ordered_io in the
previous patch so to keep the same semantics I've changed this back to
int so btrfs_finish_ordered_io forwards return value of
btrfs_finish_one_ordered(). This has no sigfnificant effect of the
patchset and I'd like to deal with the error handling separately.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: don't split ordered_extents for zoned writes at I/O submission time
  2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
                   ` (14 preceding siblings ...)
  2023-05-30 13:21 ` don't split ordered_extents for zoned writes at I/O submission time Johannes Thumshirn
@ 2023-05-30 18:48 ` David Sterba
  15 siblings, 0 replies; 37+ messages in thread
From: David Sterba @ 2023-05-30 18:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn,
	Naohiro Aota, linux-btrfs

On Wed, May 24, 2023 at 05:03:03PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series removes the unconditional splitting of ordered extents for
> zoned writes at I/O submission time, and instead splits them only when
> actually needed in the I/O completion handler.
> 
> This something I've been wanting to do for a while as it is a lot more
> efficient, but it is also really needed to make the ordered_extent
> pointer in the btrfs_bio work for zoned file systems, which made it a bit
> more urgent.  This series also borrow two patches from the series that
> adds the ordered_extent pointer to the btrfs_bio.
> 
> Currently due to the submission time splitting, there is one extra lookup
> in both the ordered extent tree and inode extent tree for each bio
> submission, and extra ordered extent lookup in the bio completion
> handler, and two extent tree lookups per bio / split ordered extent in
> the ordered extent completion handler.  With this series this is reduced
> to a single inode extent tree lookup for the best case, with an extra
> inode extent tree and ordered extent lookup when a split actually needs
> to occur due to reordering inside an ordered_extent.
> 
> Diffstat:
>  bio.c          |   20 -----
>  bio.h          |   17 +++-
>  btrfs_inode.h  |    2 
>  extent_map.c   |  101 ++++++++++++++++++++++++++--
>  extent_map.h   |    6 -
>  file-item.c    |   15 +++-
>  fs.h           |    2 
>  inode.c        |  138 +++++++-------------------------------
>  ordered-data.c |  205 +++++++++++++++++++++++++++++++++++----------------------
>  ordered-data.h |   22 ++----
>  relocation.c   |    4 -
>  tree-log.c     |   12 +--
>  zoned.c        |   94 ++++++++++++++------------
>  zoned.h        |    6 -
>  14 files changed, 352 insertions(+), 292 deletions(-)

Added to misc-next, thanks. The left out patches and the adjustments are
commented under the patches. I haven't tested this much yet but overall
it's straightforward and is more relevant for zoned.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io
  2023-05-30 15:44   ` David Sterba
@ 2023-05-31  4:00     ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2023-05-31  4:00 UTC (permalink / raw)
  To: David Sterba
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Johannes Thumshirn, Naohiro Aota, linux-btrfs

On Tue, May 30, 2023 at 05:44:15PM +0200, David Sterba wrote:
> On Wed, May 24, 2023 at 05:03:12PM +0200, Christoph Hellwig wrote:
> > The callers don't check the btrfs_finish_ordered_io return value, so
> > drop it.
> 
> Same general comments like in
> https://lore.kernel.org/linux-btrfs/20230530150359.GS575@twin.jikos.cz/
> 
> "Function can return void if none of its callees return an error,
>  directly or indirectly, there are no BUG_ONs left to be turned to
>  proper error handling or there's no missing error handling"
> 
> btrfs_finish_ordered_io mixes a few error handling styles, there's
> direct return -ERROR, transaction abort or mapping_set_error. Some
> called functions are not error handling everything propely and at least
> btrfs_free_reserved_extent() returns an error but is not handled.
> 
> I'm not counting the state bit handlers (clear_extent_bit) as we know
> they "should not fail". unpin_extent_cache() does not look clean either.
> 
> If 'callers don't check error values' the question is 'Should they?'.

The clear answer is no, as we're in an I/O completion handler where
there is no one we could return them to.  The errors are propagate
through the mapping state.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/14] btrfs: optimize the logical to physical mapping for zoned writes
  2023-05-24 15:03 ` [PATCH 05/14] btrfs: optimize the logical to physical mapping for zoned writes Christoph Hellwig
  2023-05-25 10:56   ` Johannes Thumshirn
@ 2023-08-18 14:03   ` Naohiro Aota
  1 sibling, 0 replies; 37+ messages in thread
From: Naohiro Aota @ 2023-08-18 14:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn,
	linux-btrfs, Shinichiro Kawasaki

On Wed, May 24, 2023 at 05:03:08PM +0200, Christoph Hellwig wrote:
> The current code to store the final logical to physical mapping for a
> zone append write in the extent tree is rather inefficient.  It first has
> to split the ordered extent so that there is one ordered extent per bio,
> so that it can look up the ordered extent on I/O completion in
> btrfs_record_physical_zoned and store the physical LBA returned by the
> block driver in the ordered extent.
> 
> btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
> see what physical address the logical address for this bio / ordered
> extent is mapped to, and then rewrite it in the extent tree.
> 
> To optimize this process, we can store the physical address assigned in
> the chunk tree to the original logical address and a pointer to
> btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
> this information to rewrite the logical address in the btrfs_ordered_sum
> structure directly at I/O completion time in btrfs_record_physical_zoned.
> btrfs_rewrite_logical_zoned then simply updates the logical address in
> the extent tree and the ordered_extent itself.
> 
> The code in btrfs_rewrite_logical_zoned now runs for all data I/O
> completions in zoned file systems, which is fine as there is no remapping
> to do for non-append writes to conventional zones or for relocation, and
> the overhead for quickly breaking out of the loop is very low.
> 
> Note that the btrfs_bio doesn't grow as the new field are places into
> a union that is so far not used for data writes and has plenty of space
> left in it.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/bio.c          |  1 +
>  fs/btrfs/bio.h          | 17 +++++++++++---
>  fs/btrfs/file-item.c    |  7 ++++++
>  fs/btrfs/inode.c        |  6 +----
>  fs/btrfs/ordered-data.c |  1 -
>  fs/btrfs/ordered-data.h |  6 -----
>  fs/btrfs/zoned.c        | 51 ++++++++---------------------------------
>  7 files changed, 33 insertions(+), 56 deletions(-)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 5fad6e032e6c76..8a4d3b707dd1b2 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -431,6 +431,7 @@ static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
>  		u64 zone_start = round_down(physical, dev->fs_info->zone_size);
>  
>  		ASSERT(btrfs_dev_is_sequential(dev, physical));
> +		btrfs_bio(bio)->orig_physical = physical;
>  		bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
>  	}
>  	btrfs_debug_in_rcu(dev->fs_info,
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index a8eca3a6567320..8a29980159b404 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -39,8 +39,8 @@ struct btrfs_bio {
>  
>  	union {
>  		/*
> -		 * Data checksumming and original I/O information for internal
> -		 * use in the btrfs_submit_bio machinery.
> +		 * For data reads: checksumming and original I/O information.
> +		 * (for internal use in the btrfs_submit_bio machinery only)
>  		 */
>  		struct {
>  			u8 *csum;
> @@ -48,7 +48,18 @@ struct btrfs_bio {
>  			struct bvec_iter saved_iter;
>  		};
>  
> -		/* For metadata parentness verification. */
> +		/*
> +		 * For data writes:
> +		 * - pointer to the checksums for this bio
> +		 * - original physical address from the allocator
> +		 *   (for zone append only)
> +		 */
> +		struct {
> +			struct btrfs_ordered_sum *sums;
> +			u64 orig_physical;
> +		};
> +
> +		/* For metadata reads: parentness verification. */
>  		struct btrfs_tree_parent_check parent_check;
>  	};
>  
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index ca321126816e94..96b54d73ba376d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -818,6 +818,13 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio)
>  
>  	}
>  	this_sum_bytes = 0;
> +
> +	/*
> +	 * The ->sums assignment is for zoned writes, where a bio never spans
> +	 * ordered extents and only done unconditionally because that's cheaper
> +	 * than a branch.
> +	 */
> +	bbio->sums = sums;
>  	btrfs_add_ordered_sum(ordered, sums);
>  	btrfs_put_ordered_extent(ordered);
>  	return 0;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 253ad6206179ce..2eee57d07d3632 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3302,14 +3302,10 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
>  		goto out;
>  	}
>  
> -	/* A valid ->physical implies a write on a sequential zone. */
> -	if (ordered_extent->physical != (u64)-1) {
> +	if (btrfs_is_zoned(fs_info)) {
>  		btrfs_rewrite_logical_zoned(ordered_extent);
>  		btrfs_zone_finish_endio(fs_info, ordered_extent->disk_bytenr,
>  					ordered_extent->disk_num_bytes);
> -	} else if (btrfs_is_data_reloc_root(inode->root)) {
> -		btrfs_zone_finish_endio(fs_info, ordered_extent->disk_bytenr,
> -					ordered_extent->disk_num_bytes);
>  	}
>  
>  	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index a9778a91511e19..324a5a8c844a72 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -209,7 +209,6 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
>  	entry->compress_type = compress_type;
>  	entry->truncated_len = (u64)-1;
>  	entry->qgroup_rsv = ret;
> -	entry->physical = (u64)-1;
>  
>  	ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
>  	entry->flags = flags;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index ebc980ac967ad4..dc700aa515b58b 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -151,12 +151,6 @@ struct btrfs_ordered_extent {
>  	struct completion completion;
>  	struct btrfs_work flush_work;
>  	struct list_head work_list;
> -
> -	/*
> -	 * Used to reverse-map physical address returned from ZONE_APPEND write
> -	 * command in a workqueue context
> -	 */
> -	u64 physical;
>  };
>  
>  static inline void
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 76411d96fa7afc..e838c2037634c2 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1657,64 +1657,33 @@ bool btrfs_use_zone_append(struct btrfs_bio *bbio)
>  void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
>  {
>  	const u64 physical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
> -	struct btrfs_ordered_extent *ordered;
> +	struct btrfs_ordered_sum *sum = bbio->sums;
>  
> -	ordered = btrfs_lookup_ordered_extent(bbio->inode, bbio->file_offset);
> -	if (WARN_ON(!ordered))
> -		return;
> -
> -	ordered->physical = physical;
> -	btrfs_put_ordered_extent(ordered);
> +	if (physical < bbio->orig_physical)
> +		sum->logical -= bbio->orig_physical - physical;
> +	else
> +		sum->logical += physical - bbio->orig_physical;
>  }
>  
>  void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
>  {
> -	struct btrfs_inode *inode = BTRFS_I(ordered->inode);
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct extent_map_tree *em_tree;
> +	struct extent_map_tree *em_tree = &BTRFS_I(ordered->inode)->extent_tree;
>  	struct extent_map *em;
> -	struct btrfs_ordered_sum *sum;
> -	u64 orig_logical = ordered->disk_bytenr;
> -	struct map_lookup *map;
> -	u64 physical = ordered->physical;
> -	u64 chunk_start_phys;
> -	u64 logical;
> -
> -	em = btrfs_get_chunk_map(fs_info, orig_logical, 1);
> -	if (IS_ERR(em))
> -		return;
> -	map = em->map_lookup;
> -	chunk_start_phys = map->stripes[0].physical;
> +	struct btrfs_ordered_sum *sum =
> +		list_first_entry(&ordered->list, typeof(*sum), list);
> +	u64 logical = sum->logical;

Here, this function does not consider the case of
list_empty(&ordered->list). As a result, the "logical" will be some random
value. 

Then, it sets the garbled value to ordered->disk_bytenr, and
btrfs_lookup_block_group() in btrfs_zone_finish_endio() failed to find a
block group and panic.

This behavior is reproduced by running fstests btrfs/028 several time with
a null_blk setup. The zone size and capacity is set as 64MB, and the
storage size is 5GB.

I still don't understand why this ordered_extent come without any csums.

>  
> -	if (WARN_ON_ONCE(map->num_stripes > 1) ||
> -	    WARN_ON_ONCE((map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) != 0) ||
> -	    WARN_ON_ONCE(physical < chunk_start_phys) ||
> -	    WARN_ON_ONCE(physical > chunk_start_phys + em->orig_block_len)) {
> -		free_extent_map(em);
> -		return;
> -	}
> -	logical = em->start + (physical - map->stripes[0].physical);
> -	free_extent_map(em);
> -
> -	if (orig_logical == logical)
> +	if (ordered->disk_bytenr == logical)
>  		return;
>  
>  	ordered->disk_bytenr = logical;
>  
> -	em_tree = &inode->extent_tree;
>  	write_lock(&em_tree->lock);
>  	em = search_extent_mapping(em_tree, ordered->file_offset,
>  				   ordered->num_bytes);
>  	em->block_start = logical;
>  	free_extent_map(em);
>  	write_unlock(&em_tree->lock);
> -
> -	list_for_each_entry(sum, &ordered->list, list) {
> -		if (logical < orig_logical)
> -			sum->logical -= orig_logical - logical;
> -		else
> -			sum->logical += logical - orig_logical;
> -	}
>  }
>  
>  bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-08-18 14:04 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24 15:03 don't split ordered_extents for zoned writes at I/O submission time Christoph Hellwig
2023-05-24 15:03 ` [PATCH 01/14] btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED Christoph Hellwig
2023-05-25  8:33   ` Johannes Thumshirn
2023-05-24 15:03 ` [PATCH 02/14] btrfs: don't call btrfs_record_physical_zoned for failed append Christoph Hellwig
2023-05-25  8:33   ` Johannes Thumshirn
2023-05-24 15:03 ` [PATCH 03/14] btrfs: mark the len field in struct btrfs_ordered_sum as unsigned Christoph Hellwig
2023-05-25  8:33   ` Johannes Thumshirn
2023-05-30 16:45   ` David Sterba
2023-05-24 15:03 ` [PATCH 04/14] btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical Christoph Hellwig
2023-05-25  8:33   ` Johannes Thumshirn
2023-05-24 15:03 ` [PATCH 05/14] btrfs: optimize the logical to physical mapping for zoned writes Christoph Hellwig
2023-05-25 10:56   ` Johannes Thumshirn
2023-08-18 14:03   ` Naohiro Aota
2023-05-24 15:03 ` [PATCH 06/14] btrfs: move split_extent_map to extent_map.c Christoph Hellwig
2023-05-25 10:58   ` Johannes Thumshirn
2023-05-24 15:03 ` [PATCH 07/14] btrfs: reorder btrfs_extract_ordered_extent Christoph Hellwig
2023-05-24 15:03 ` [PATCH 08/14] btrfs: return the new ordered_extent from btrfs_split_ordered_extent Christoph Hellwig
2023-05-24 15:03 ` [PATCH 09/14] btrfs: return void from btrfs_finish_ordered_io Christoph Hellwig
2023-05-25 11:02   ` Johannes Thumshirn
2023-05-30 15:44   ` David Sterba
2023-05-31  4:00     ` Christoph Hellwig
2023-05-24 15:03 ` [PATCH 10/14] btrfs: split btrfs_alloc_ordered_extent Christoph Hellwig
2023-05-25 12:09   ` Johannes Thumshirn
2023-05-24 15:03 ` [PATCH 11/14] btrfs: atomically insert the new extent in btrfs_split_ordered_extent Christoph Hellwig
2023-05-25 12:30   ` Johannes Thumshirn
2023-05-25 12:34     ` Christoph Hellwig
2023-05-25 16:23       ` Johannes Thumshirn
2023-05-24 15:03 ` [PATCH 12/14] btrfs: handle completed ordered extents " Christoph Hellwig
2023-05-25 13:06   ` Johannes Thumshirn
2023-05-24 15:03 ` [PATCH 13/14] btrfs: defer splitting of ordered extents until I/O completion Christoph Hellwig
2023-05-25 16:25   ` Johannes Thumshirn
2023-05-30 18:40   ` David Sterba
2023-05-24 15:03 ` [PATCH 14/14] btrfs: pass the new logical address to split_extent_map Christoph Hellwig
2023-05-25 16:28   ` Johannes Thumshirn
2023-05-30 13:21 ` don't split ordered_extents for zoned writes at I/O submission time Johannes Thumshirn
2023-05-30 14:20   ` Christoph Hellwig
2023-05-30 18:48 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.