linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* btrfs: fix corruption caused by partial dio writes v6
@ 2023-03-24  2:31 Christoph Hellwig
  2023-03-24  2:31 ` [PATCH 01/11] btrfs: add function to create and return an ordered extent Christoph Hellwig
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:31 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

[this is a resend of the series from Boris, with my changes to avoid
 the three-way split in btrfs_extract_ordered_extent inserted in the
 middle]

The final patch in this series ensures that bios submitted by btrfs dio
match the corresponding ordered_extent and extent_map exactly. As a
result, there is no hole or deadlock as a result of partial writes, even
if the write buffer is a partly overlapping mapping of the range being
written to.

This required a bit of refactoring and setup. Specifically, the zoned
device code for "extracting" an ordered extent matching a bio could be
reused with some refactoring to return the new ordered extents after the
split.

Changes since v5:
 - avoid three-way splits in btrfs_extract_ordered_extent

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/11] btrfs: add function to create and return an ordered extent
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
@ 2023-03-24  2:31 ` Christoph Hellwig
  2023-03-24  5:47   ` Naohiro Aota
  2023-03-24  2:31 ` [PATCH 02/11] btrfs: pass flags as unsigned long to btrfs_add_ordered_extent Christoph Hellwig
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:31 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs,
	Filipe Manana

From: Boris Burkov <boris@bur.io>

Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds
it to the rb_tree, but doesn't return a referenced pointer to the
caller. There are cases where it is useful for the creator of a new
ordered_extent to hang on to such a pointer, so add a new function
btrfs_alloc_ordered_extent which is the same as
btrfs_add_ordered_extent, except it takes an additional reference count
and returns a pointer to the ordered_extent. Implement
btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by
dropping the new reference and handling the IS_ERR case.

The type of flags in btrfs_alloc_ordered_extent and
btrfs_add_ordered_extent is changed from unsigned int to unsigned long
so it's unified with the other ordered extent functions.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ordered-data.c | 46 +++++++++++++++++++++++++++++++++--------
 fs/btrfs/ordered-data.h |  5 +++++
 2 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6c24b69e2d0a37..83a51c692406ab 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -160,14 +160,16 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
  * @compress_type:   Compression algorithm used for data.
  *
  * Most of these parameters correspond to &struct btrfs_file_extent_item. The
- * tree is given a single reference on the ordered extent that was inserted.
+ * tree is given a single reference on the ordered extent that was inserted, and
+ * the returned pointer is given a second reference.
  *
- * Return: 0 or -ENOMEM.
+ * Return: the new ordered extent or ERR_PTR(-ENOMEM).
  */
-int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			     u64 disk_num_bytes, u64 offset, unsigned flags,
-			     int compress_type)
+struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
+			struct btrfs_inode *inode, u64 file_offset,
+			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			u64 disk_num_bytes, u64 offset, unsigned long flags,
+			int compress_type)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -181,7 +183,7 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 		/* For nocow write, we can release the qgroup rsv right now */
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
 		if (ret < 0)
-			return ret;
+			return ERR_PTR(ret);
 		ret = 0;
 	} else {
 		/*
@@ -190,11 +192,11 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 		 */
 		ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes);
 		if (ret < 0)
-			return ret;
+			return ERR_PTR(ret);
 	}
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
 	if (!entry)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	entry->file_offset = file_offset;
 	entry->num_bytes = num_bytes;
@@ -256,6 +258,32 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 	btrfs_mod_outstanding_extents(inode, 1);
 	spin_unlock(&inode->lock);
 
+	/* One ref for the returned entry to match semantics of lookup. */
+	refcount_inc(&entry->refs);
+
+	return entry;
+}
+
+/*
+ * Add a new btrfs_ordered_extent for the range, but drop the reference instead
+ * of returning it to the caller.
+ */
+int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, unsigned flags,
+			     int compress_type)
+{
+	struct btrfs_ordered_extent *ordered;
+
+	ordered = btrfs_alloc_ordered_extent(inode, file_offset, num_bytes,
+					     ram_bytes, disk_bytenr,
+					     disk_num_bytes, offset, flags,
+					     compress_type);
+
+	if (IS_ERR(ordered))
+		return PTR_ERR(ordered);
+	btrfs_put_ordered_extent(ordered);
+
 	return 0;
 }
 
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index eb40cb39f842e6..c00a5a3f060fa2 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -178,6 +178,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
 				    u64 file_offset, u64 io_size);
+struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
+			struct btrfs_inode *inode, u64 file_offset,
+			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			u64 disk_num_bytes, u64 offset, unsigned long flags,
+			int compress_type);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
 			     u64 disk_num_bytes, u64 offset, unsigned flags,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/11] btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
  2023-03-24  2:31 ` [PATCH 01/11] btrfs: add function to create and return an ordered extent Christoph Hellwig
@ 2023-03-24  2:31 ` Christoph Hellwig
  2023-03-24  4:53   ` Naohiro Aota
  2023-03-24  2:31 ` [PATCH 03/11] btrfs: stash ordered extent in dio_data during iomap dio Christoph Hellwig
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:31 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

From: Boris Burkov <boris@bur.io>

The ordered_extent flags are declared as unsigned long, so pass them as
such to btrfs_add_ordered_extent.

Signed-off-by: Boris Burkov <boris@bur.io>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ordered-data.c | 2 +-
 fs/btrfs/ordered-data.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 83a51c692406ab..1848d0d1a9c41e 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -270,7 +270,7 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
  */
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			     u64 disk_num_bytes, u64 offset, unsigned flags,
+			     u64 disk_num_bytes, u64 offset, unsigned long flags,
 			     int compress_type)
 {
 	struct btrfs_ordered_extent *ordered;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index c00a5a3f060fa2..18007f9c00add8 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -185,7 +185,7 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 			int compress_type);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			     u64 disk_num_bytes, u64 offset, unsigned flags,
+			     u64 disk_num_bytes, u64 offset, unsigned long flags,
 			     int compress_type);
 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/11] btrfs: stash ordered extent in dio_data during iomap dio
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
  2023-03-24  2:31 ` [PATCH 01/11] btrfs: add function to create and return an ordered extent Christoph Hellwig
  2023-03-24  2:31 ` [PATCH 02/11] btrfs: pass flags as unsigned long to btrfs_add_ordered_extent Christoph Hellwig
@ 2023-03-24  2:31 ` Christoph Hellwig
  2023-03-24  7:41   ` Naohiro Aota
  2023-03-24  2:32 ` [PATCH 04/11] btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent Christoph Hellwig
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:31 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

From: Boris Burkov <boris@bur.io>

While it is not feasible for an ordered extent to survive across the
calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful
to stash it on the dio_data in between creating it in iomap_begin and
finishing it in either end_io or iomap_end.

The specific use I have in mind is that we can check if a partcular bio
is partial in submit_io without unconditionally looking up the ordered
extent. This is a preparatory patch for a later patch which does just
that.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 865d56ff2ce150..a92f482e898950 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -81,6 +81,7 @@ struct btrfs_dio_data {
 	struct extent_changeset *data_reserved;
 	bool data_space_reserved;
 	bool nocow_done;
+	struct btrfs_ordered_extent *ordered;
 };
 
 struct btrfs_dio_private {
@@ -6965,6 +6966,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 }
 
 static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
+						  struct btrfs_dio_data *dio_data,
 						  const u64 start,
 						  const u64 len,
 						  const u64 orig_start,
@@ -6975,7 +6977,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  const int type)
 {
 	struct extent_map *em = NULL;
-	int ret;
+	struct btrfs_ordered_extent *ordered;
 
 	if (type != BTRFS_ORDERED_NOCOW) {
 		em = create_io_em(inode, start, len, orig_start, block_start,
@@ -6985,18 +6987,21 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 		if (IS_ERR(em))
 			goto out;
 	}
-	ret = btrfs_add_ordered_extent(inode, start, len, len, block_start,
-				       block_len, 0,
-				       (1 << type) |
-				       (1 << BTRFS_ORDERED_DIRECT),
-				       BTRFS_COMPRESS_NONE);
-	if (ret) {
+	ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
+					     block_start, block_len, 0,
+					     (1 << type) |
+					     (1 << BTRFS_ORDERED_DIRECT),
+					     BTRFS_COMPRESS_NONE);
+	if (IS_ERR(ordered)) {
 		if (em) {
 			free_extent_map(em);
 			btrfs_drop_extent_map_range(inode, start,
 						    start + len - 1, false);
 		}
-		em = ERR_PTR(ret);
+		em = ERR_PTR(PTR_ERR(ordered));
+	} else {
+		ASSERT(!dio_data->ordered);
+		dio_data->ordered = ordered;
 	}
  out:
 
@@ -7004,6 +7009,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 }
 
 static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
+						  struct btrfs_dio_data *dio_data,
 						  u64 start, u64 len)
 {
 	struct btrfs_root *root = inode->root;
@@ -7019,7 +7025,8 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 	if (ret)
 		return ERR_PTR(ret);
 
-	em = btrfs_create_dio_extent(inode, start, ins.offset, start,
+	em = btrfs_create_dio_extent(inode, dio_data,
+				     start, ins.offset, start,
 				     ins.objectid, ins.offset, ins.offset,
 				     ins.offset, BTRFS_ORDERED_REGULAR);
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
@@ -7364,7 +7371,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		}
 		space_reserved = true;
 
-		em2 = btrfs_create_dio_extent(BTRFS_I(inode), start, len,
+		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
 					      orig_start, block_start,
 					      len, orig_block_len,
 					      ram_bytes, type);
@@ -7406,7 +7413,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 			goto out;
 		space_reserved = true;
 
-		em = btrfs_new_extent_direct(BTRFS_I(inode), start, len);
+		em = btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len);
 		if (IS_ERR(em)) {
 			ret = PTR_ERR(em);
 			goto out;
@@ -7712,6 +7719,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 				      pos + length - 1, NULL);
 		ret = -ENOTBLK;
 	}
+	if (write) {
+		btrfs_put_ordered_extent(dio_data->ordered);
+		dio_data->ordered = NULL;
+	}
 
 	if (write)
 		extent_changeset_free(dio_data->data_reserved);
@@ -7773,7 +7784,7 @@ static const struct iomap_dio_ops btrfs_dio_ops = {
 
 ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_before)
 {
-	struct btrfs_dio_data data;
+	struct btrfs_dio_data data = { 0 };
 
 	return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
 			    IOMAP_DIO_PARTIAL, &data, done_before);
@@ -7782,7 +7793,7 @@ ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_be
 struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 				  size_t done_before)
 {
-	struct btrfs_dio_data data;
+	struct btrfs_dio_data data = { 0 };
 
 	return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
 			    IOMAP_DIO_PARTIAL, &data, done_before);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/11] btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2023-03-24  2:31 ` [PATCH 03/11] btrfs: stash ordered extent in dio_data during iomap dio Christoph Hellwig
@ 2023-03-24  2:32 ` Christoph Hellwig
  2023-03-24  2:32 ` [PATCH 05/11] btrfs: simplify btrfs_extract_ordered_extent Christoph Hellwig
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:32 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

Move the three checks that are about ordered extent internal sanity checking
into btrfs_split_ordered_extent instead of doing them in the higher level
btrfs_extract_ordered_extent routine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c        | 18 ------------------
 fs/btrfs/ordered-data.c | 10 ++++++++++
 2 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a92f482e898950..4f2f1aafd1720e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2646,18 +2646,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 	if (ordered->disk_num_bytes == len)
 		goto out;
 
-	/* We cannot split once end_bio'd ordered extent */
-	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
-	/* We cannot split a compressed ordered extent */
-	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
 	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
 	/* bio must be in one ordered extent */
 	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
@@ -2665,12 +2653,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 		goto out;
 	}
 
-	/* Checksum list should be empty */
-	if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
-		ret = -EINVAL;
-		goto out;
-	}
-
 	file_len = ordered->num_bytes;
 	pre = start - ordered->disk_bytenr;
 	post = ordered_end - end;
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 1848d0d1a9c41e..4b46406c0c8af5 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1149,6 +1149,16 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 
 	trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
 
+	/* We cannot split once end_bio'd ordered extent */
+	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes))
+		return -EINVAL;
+	/* We cannot split a compressed ordered extent */
+	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes))
+		return -EINVAL;
+	/* Checksum list should be empty */
+	if (WARN_ON_ONCE(!list_empty(&ordered->list)))
+		return -EINVAL;
+
 	spin_lock_irq(&tree->lock);
 	/* Remove from tree once */
 	node = &ordered->rb_node;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/11] btrfs: simplify btrfs_extract_ordered_extent
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2023-03-24  2:32 ` [PATCH 04/11] btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent Christoph Hellwig
@ 2023-03-24  2:32 ` Christoph Hellwig
  2023-03-24  6:07   ` Naohiro Aota
  2023-03-24  2:32 ` [PATCH 06/11] btrfs: simplify btrfs_split_ordered_extent Christoph Hellwig
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:32 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

btrfs_extract_ordered_extent must always be called for the beginning of
an ordered_extent.  Add an assert to check for that and simplify the
calculation of the split ranges for btrfs_split_ordered_extent and
split_zoned_em.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f2f1aafd1720e..2cbc6c316effc1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2632,39 +2632,36 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 	u64 len = bbio->bio.bi_iter.bi_size;
 	struct btrfs_inode *inode = bbio->inode;
 	struct btrfs_ordered_extent *ordered;
-	u64 file_len;
-	u64 end = start + len;
-	u64 ordered_end;
-	u64 pre, post;
+	u64 ordered_len;
 	int ret = 0;
 
 	ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset);
 	if (WARN_ON_ONCE(!ordered))
 		return BLK_STS_IOERR;
+	ordered_len = ordered->num_bytes;
 
-	/* No need to split */
-	if (ordered->disk_num_bytes == len)
+	/* Must always be called for the beginning of an ordered extent. */
+	if (WARN_ON_ONCE(start != ordered->disk_bytenr)) {
+		ret = -EINVAL;
 		goto out;
+	}
 
-	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
-	/* bio must be in one ordered extent */
-	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
+	/* The bio must be entirely covered by the ordered extent */
+	if (WARN_ON_ONCE(len > ordered_len)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	file_len = ordered->num_bytes;
-	pre = start - ordered->disk_bytenr;
-	post = ordered_end - end;
+	/* No need to split if the ordered extent covers the entire bio */
+	if (ordered->disk_num_bytes == len)
+		goto out;
 
-	ret = btrfs_split_ordered_extent(ordered, pre, post);
+	ret = btrfs_split_ordered_extent(ordered, len, 0);
 	if (ret)
 		goto out;
-	ret = split_zoned_em(inode, bbio->file_offset, file_len, pre, post);
-
+	ret = split_zoned_em(inode, bbio->file_offset, ordered_len, len, 0);
 out:
 	btrfs_put_ordered_extent(ordered);
-
 	return errno_to_blk_status(ret);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/11] btrfs: simplify btrfs_split_ordered_extent
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2023-03-24  2:32 ` [PATCH 05/11] btrfs: simplify btrfs_extract_ordered_extent Christoph Hellwig
@ 2023-03-24  2:32 ` Christoph Hellwig
  2023-03-24  7:52   ` Naohiro Aota
  2023-03-24  2:32 ` [PATCH 07/11] btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent Christoph Hellwig
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:32 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

btrfs_split_ordered_extent is only ever asked to split out the beginning
of an ordered_extent.  Change it to only take a len to split out, and
switch it to allocate the new extent for the beginning, as that helps
with callers that want to keep a pointer to the ordered_extent that
it is stealing from.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c        |  8 +-------
 fs/btrfs/ordered-data.c | 28 ++++++++++++----------------
 fs/btrfs/ordered-data.h |  3 +--
 3 files changed, 14 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2cbc6c316effc1..bff23bac85f2ef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2646,17 +2646,11 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 		goto out;
 	}
 
-	/* The bio must be entirely covered by the ordered extent */
-	if (WARN_ON_ONCE(len > ordered_len)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
 	/* No need to split if the ordered extent covers the entire bio */
 	if (ordered->disk_num_bytes == len)
 		goto out;
 
-	ret = btrfs_split_ordered_extent(ordered, len, 0);
+	ret = btrfs_split_ordered_extent(ordered, len);
 	if (ret)
 		goto out;
 	ret = split_zoned_em(inode, bbio->file_offset, ordered_len, len, 0);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4b46406c0c8af5..1d5971b6e68c66 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1138,17 +1138,19 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
 					ordered->compress_type);
 }
 
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
-				u64 post)
+/* split out a new ordered extent for this first @len bytes of @ordered */
+int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 {
 	struct inode *inode = ordered->inode;
 	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
-	struct rb_node *node;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	int ret = 0;
+	struct rb_node *node;
 
 	trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
 
+	/* The bio must be entirely covered by the ordered extent */
+	if (WARN_ON_ONCE(len > ordered->num_bytes))
+		return -EINVAL;
 	/* We cannot split once end_bio'd ordered extent */
 	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes))
 		return -EINVAL;
@@ -1167,11 +1169,11 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 	if (tree->last == node)
 		tree->last = NULL;
 
-	ordered->file_offset += pre;
-	ordered->disk_bytenr += pre;
-	ordered->num_bytes -= (pre + post);
-	ordered->disk_num_bytes -= (pre + post);
-	ordered->bytes_left -= (pre + post);
+	ordered->file_offset += len;
+	ordered->disk_bytenr += len;
+	ordered->num_bytes -= len;
+	ordered->disk_num_bytes -= len;
+	ordered->bytes_left -= len;
 
 	/* Re-insert the node */
 	node = tree_insert(&tree->tree, ordered->file_offset, &ordered->rb_node);
@@ -1182,13 +1184,7 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 
 	spin_unlock_irq(&tree->lock);
 
-	if (pre)
-		ret = clone_ordered_extent(ordered, 0, pre);
-	if (ret == 0 && post)
-		ret = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes,
-					   post);
-
-	return ret;
+	return clone_ordered_extent(ordered, 0, len);
 }
 
 int __init ordered_data_init(void)
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 18007f9c00add8..f0f1138d23c331 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -212,8 +212,7 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					struct extent_state **cached_state);
 bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct extent_state **cached_state);
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
-			       u64 post);
+int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/11] btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2023-03-24  2:32 ` [PATCH 06/11] btrfs: simplify btrfs_split_ordered_extent Christoph Hellwig
@ 2023-03-24  2:32 ` Christoph Hellwig
  2023-03-24  2:32 ` [PATCH 08/11] btrfs: simplify split_zoned_em Christoph Hellwig
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:32 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

btrfs_clone_ordered_extent is very specific to the usage in
btrfs_split_ordered_extent.  Now that only a single call to
btrfs_clone_ordered_extent is left, just fold it into
btrfs_split_ordered_extent to make the operation more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ordered-data.c | 37 ++++++++++++++-----------------------
 1 file changed, 14 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 1d5971b6e68c66..b5a86bb13115db 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1116,38 +1116,21 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 	return false;
 }
 
-
-static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
-				u64 len)
-{
-	struct inode *inode = ordered->inode;
-	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
-	u64 file_offset = ordered->file_offset + pos;
-	u64 disk_bytenr = ordered->disk_bytenr + pos;
-	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
-
-	/*
-	 * The splitting extent is already counted and will be added again in
-	 * btrfs_add_ordered_extent_*(). Subtract len to avoid double counting.
-	 */
-	percpu_counter_add_batch(&fs_info->ordered_bytes, -len,
-				 fs_info->delalloc_batch);
-	WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED));
-	return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len,
-					disk_bytenr, len, 0, flags,
-					ordered->compress_type);
-}
-
 /* split out a new ordered extent for this first @len bytes of @ordered */
 int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 {
 	struct inode *inode = ordered->inode;
 	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u64 file_offset = ordered->file_offset;
+	u64 disk_bytenr = ordered->disk_bytenr;
+	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
 	struct rb_node *node;
 
 	trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
 
+	ASSERT(!(flags & (1 << BTRFS_ORDERED_COMPRESSED)));
+
 	/* The bio must be entirely covered by the ordered extent */
 	if (WARN_ON_ONCE(len > ordered->num_bytes))
 		return -EINVAL;
@@ -1184,7 +1167,15 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
 
 	spin_unlock_irq(&tree->lock);
 
-	return clone_ordered_extent(ordered, 0, len);
+	/*
+	 * The splitting extent is already counted and will be added again in
+	 * btrfs_add_ordered_extent(). Subtract len to avoid double counting.
+	 */
+	percpu_counter_add_batch(&fs_info->ordered_bytes, -len,
+				 fs_info->delalloc_batch);
+	return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len,
+					disk_bytenr, len, 0, flags,
+					ordered->compress_type);
 }
 
 int __init ordered_data_init(void)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/11] btrfs: simplify split_zoned_em
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2023-03-24  2:32 ` [PATCH 07/11] btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent Christoph Hellwig
@ 2023-03-24  2:32 ` Christoph Hellwig
  2023-03-24  2:32 ` [PATCH 09/11] btrfs: pass an ordered_extent to btrfs_extract_ordered_extent Christoph Hellwig
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:32 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

split_zoned_em is only ever asked to split out the beginning of an extent
map.  Change it to only take a len to split out instead of a pre and post
region.

Also rename the function to split_extent_map as there is nothing zoned
device specific about it.

Note: this function should probably move to extent_map.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 78 +++++++++++++++++-------------------------------
 1 file changed, 27 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bff23bac85f2ef..5b3d6403363814 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2512,37 +2512,32 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 }
 
 /*
- * Split an extent_map at [start, start + len]
+ * Split off the first pre bytes from the extent_map at [start, start + len]
  *
  * This function is intended to be used only for extract_ordered_extent().
  */
-static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
-			  u64 pre, u64 post)
+static int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len,
+			    u64 pre)
 {
 	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	struct extent_map *split_pre = NULL;
 	struct extent_map *split_mid = NULL;
-	struct extent_map *split_post = NULL;
 	int ret = 0;
 	unsigned long flags;
 
-	/* Sanity check */
-	if (pre == 0 && post == 0)
-		return 0;
+	ASSERT(pre != 0);
+	ASSERT(pre < len);
 
 	split_pre = alloc_extent_map();
-	if (pre)
-		split_mid = alloc_extent_map();
-	if (post)
-		split_post = alloc_extent_map();
-	if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
+	if (!split_pre)
+		return -ENOMEM;
+	split_mid = alloc_extent_map();
+	if (!split_mid) {
 		ret = -ENOMEM;
-		goto out;
+		goto out_free_pre;
 	}
 
-	ASSERT(pre + post < len);
-
 	lock_extent(&inode->io_tree, start, start + len - 1, NULL);
 	write_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, start, len);
@@ -2563,7 +2558,7 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 
 	/* First, replace the em with a new extent_map starting from * em->start */
 	split_pre->start = em->start;
-	split_pre->len = (pre ? pre : em->len - post);
+	split_pre->len = pre;
 	split_pre->orig_start = split_pre->start;
 	split_pre->block_start = em->block_start;
 	split_pre->block_len = split_pre->len;
@@ -2577,38 +2572,21 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 
 	/*
 	 * Now we only have an extent_map at:
-	 *     [em->start, em->start + pre] if pre != 0
-	 *     [em->start, em->start + em->len - post] if pre == 0
+	 *     [em->start, em->start + pre]
 	 */
 
-	if (pre) {
-		/* Insert the middle extent_map */
-		split_mid->start = em->start + pre;
-		split_mid->len = em->len - pre - post;
-		split_mid->orig_start = split_mid->start;
-		split_mid->block_start = em->block_start + pre;
-		split_mid->block_len = split_mid->len;
-		split_mid->orig_block_len = split_mid->block_len;
-		split_mid->ram_bytes = split_mid->len;
-		split_mid->flags = flags;
-		split_mid->compress_type = em->compress_type;
-		split_mid->generation = em->generation;
-		add_extent_mapping(em_tree, split_mid, 1);
-	}
-
-	if (post) {
-		split_post->start = em->start + em->len - post;
-		split_post->len = post;
-		split_post->orig_start = split_post->start;
-		split_post->block_start = em->block_start + em->len - post;
-		split_post->block_len = split_post->len;
-		split_post->orig_block_len = split_post->block_len;
-		split_post->ram_bytes = split_post->len;
-		split_post->flags = flags;
-		split_post->compress_type = em->compress_type;
-		split_post->generation = em->generation;
-		add_extent_mapping(em_tree, split_post, 1);
-	}
+	/* Insert the middle extent_map */
+	split_mid->start = em->start + pre;
+	split_mid->len = em->len - pre;
+	split_mid->orig_start = split_mid->start;
+	split_mid->block_start = em->block_start + pre;
+	split_mid->block_len = split_mid->len;
+	split_mid->orig_block_len = split_mid->block_len;
+	split_mid->ram_bytes = split_mid->len;
+	split_mid->flags = flags;
+	split_mid->compress_type = em->compress_type;
+	split_mid->generation = em->generation;
+	add_extent_mapping(em_tree, split_mid, 1);
 
 	/* Once for us */
 	free_extent_map(em);
@@ -2618,11 +2596,9 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 out_unlock:
 	write_unlock(&em_tree->lock);
 	unlock_extent(&inode->io_tree, start, start + len - 1, NULL);
-out:
-	free_extent_map(split_pre);
 	free_extent_map(split_mid);
-	free_extent_map(split_post);
-
+out_free_pre:
+	free_extent_map(split_pre);
 	return ret;
 }
 
@@ -2653,7 +2629,7 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 	ret = btrfs_split_ordered_extent(ordered, len);
 	if (ret)
 		goto out;
-	ret = split_zoned_em(inode, bbio->file_offset, ordered_len, len, 0);
+	ret = split_extent_map(inode, bbio->file_offset, ordered_len, len);
 out:
 	btrfs_put_ordered_extent(ordered);
 	return errno_to_blk_status(ret);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/11] btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2023-03-24  2:32 ` [PATCH 08/11] btrfs: simplify split_zoned_em Christoph Hellwig
@ 2023-03-24  2:32 ` Christoph Hellwig
  2023-03-24  2:32 ` [PATCH 10/11] btrfs: don't split nocow extent_maps in btrfs_extract_ordered_extent Christoph Hellwig
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:32 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

To prepare for a new caller that already has the ordered_extent
available, change btrfs_extract_ordered_extent to take an argument
for it.  Add a wrapper for the bio case that still has to do the
lookup (for now).

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/bio.c         | 16 +++++++++++++++-
 fs/btrfs/btrfs_inode.h |  3 ++-
 fs/btrfs/inode.c       | 26 ++++++++------------------
 3 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index cf09c6271edbee..1bb6a45edc2354 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -61,6 +61,20 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
 	return bbio;
 }
 
+static blk_status_t btrfs_bio_extract_ordered_extent(struct btrfs_bio *bbio)
+{
+	struct btrfs_ordered_extent *ordered;
+	int ret;
+
+	ordered = btrfs_lookup_ordered_extent(bbio->inode, bbio->file_offset);
+	if (WARN_ON_ONCE(!ordered))
+		return BLK_STS_IOERR;
+	ret = btrfs_extract_ordered_extent(bbio, ordered);
+	btrfs_put_ordered_extent(ordered);
+
+	return errno_to_blk_status(ret);
+}
+
 static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
 					 struct btrfs_bio *orig_bbio,
 					 u64 map_length, bool use_append)
@@ -653,7 +667,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 		if (use_append) {
 			bio->bi_opf &= ~REQ_OP_WRITE;
 			bio->bi_opf |= REQ_OP_ZONE_APPEND;
-			ret = btrfs_extract_ordered_extent(bbio);
+			ret = btrfs_bio_extract_ordered_extent(bbio);
 			if (ret)
 				goto fail_put_bio;
 		}
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9dc21622806ef4..bb498448066981 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -407,7 +407,8 @@ static inline void btrfs_inode_split_flags(u64 inode_item_flags,
 
 int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
 			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
-blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio);
+int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
+				 struct btrfs_ordered_extent *ordered);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5b3d6403363814..2abbedfd31c2c2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2602,37 +2602,27 @@ static int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len,
 	return ret;
 }
 
-blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
+int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
+				 struct btrfs_ordered_extent *ordered)
 {
 	u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 len = bbio->bio.bi_iter.bi_size;
 	struct btrfs_inode *inode = bbio->inode;
-	struct btrfs_ordered_extent *ordered;
-	u64 ordered_len;
+	u64 ordered_len = ordered->num_bytes;
 	int ret = 0;
 
-	ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset);
-	if (WARN_ON_ONCE(!ordered))
-		return BLK_STS_IOERR;
-	ordered_len = ordered->num_bytes;
-
 	/* Must always be called for the beginning of an ordered extent. */
-	if (WARN_ON_ONCE(start != ordered->disk_bytenr)) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (WARN_ON_ONCE(start != ordered->disk_bytenr))
+		return -EINVAL;
 
 	/* No need to split if the ordered extent covers the entire bio */
 	if (ordered->disk_num_bytes == len)
-		goto out;
+		return 0;
 
 	ret = btrfs_split_ordered_extent(ordered, len);
 	if (ret)
-		goto out;
-	ret = split_extent_map(inode, bbio->file_offset, ordered_len, len);
-out:
-	btrfs_put_ordered_extent(ordered);
-	return errno_to_blk_status(ret);
+		return ret;
+	return split_extent_map(inode, bbio->file_offset, ordered_len, len);
 }
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/11] btrfs: don't split nocow extent_maps in btrfs_extract_ordered_extent
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2023-03-24  2:32 ` [PATCH 09/11] btrfs: pass an ordered_extent to btrfs_extract_ordered_extent Christoph Hellwig
@ 2023-03-24  2:32 ` Christoph Hellwig
  2023-03-24  2:32 ` [PATCH 11/11] btrfs: split partial dio bios before submit Christoph Hellwig
  2023-03-24 13:10 ` btrfs: fix corruption caused by partial dio writes v6 Johannes Thumshirn
  11 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:32 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

From: Boris Burkov <boris@bur.io>

Nocow writes just overwrite an existing extent map, which thus should
not be split in btrfs_extract_ordered_extent.  The nocow case can't
currently happen as btrfs_extract_ordered_extent is only used on zoned
devices that do not support nocow writes, but this will change soon.

Signed-off-by: Boris Burkov <boris@bur.io>
[hch: split from a larger patch, wrote a commit log]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2abbedfd31c2c2..f72ba14dcda55e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2622,6 +2622,14 @@ int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 	ret = btrfs_split_ordered_extent(ordered, len);
 	if (ret)
 		return ret;
+
+	/*
+	 * Don't split the extent_map for nocow extents, as we're writing
+	 * into a pre-existing one.
+	 */
+	if (test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
+		return 0;
+
 	return split_extent_map(inode, bbio->file_offset, ordered_len, len);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 11/11] btrfs: split partial dio bios before submit
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2023-03-24  2:32 ` [PATCH 10/11] btrfs: don't split nocow extent_maps in btrfs_extract_ordered_extent Christoph Hellwig
@ 2023-03-24  2:32 ` Christoph Hellwig
  2023-03-24 13:10 ` btrfs: fix corruption caused by partial dio writes v6 Johannes Thumshirn
  11 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-24  2:32 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Johannes Thumshirn, Naohiro Aota, linux-btrfs

From: Boris Burkov <boris@bur.io>

If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this codepath in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.

The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:

====WRITING TASK====
btrfs_direct_write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create full ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # page fault; partial read
submit_bio # partial bio
iomap_iter
btrfs_dio_iomap_end
btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
			       # submit to finish_ordered_fn wq
fault_in_iov_iter_readable # btrfs_direct_write detects partial write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create second partial ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # read all of remainder
submit_bio # partial bio with all of remainder
iomap_iter
btrfs_dio_iomap_end # nothing exciting to do with ordered io

====DIO ENDIO====
==FIRST PARTIAL BIO==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left > 0
			     # don't submit to finish_ordered_fn wq
==SECOND PARTIAL BIO==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left == 0
			     # submit to finish_ordered_fn wq

====BTRFS FINISH ORDERED WQ====
==FIRST PARTIAL BIO==
btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
		    # BTRFS_ORDERED_IOERR, just drops the
		    # ordered_extent
==SECOND PARTIAL BIO==
btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
		    # extents, csums, etc...

The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.

More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc...
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.

We considered a few options for fixing the bug (apologies for any
incorrect summary of a proposal which I didn't implement and fully
understand):
1. treat the partial bio as if we had truncated the file, which would
result in properly finishing it.
2. split the ordered extent when submitting a partial bio.
3. cache the ordered extent across calls to __iomap_dio_rw in
iter->private, so that we could reuse it and correctly apply several
bios to it.

I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below)

Therefore, go with fix #2, which requires a bit more setup work but
fixes the corruption without introducing the deadlock, which is
fundamentally caused by the ordered extent existing when we attempt to
fault in a range that overlaps with it.

Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.

Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.

Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
Link: https://pastebin.com/3SDaH8C6
Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t
Signed-off-by: Boris Burkov <boris@bur.io>
[hch: refactored the ordered_extent extraction]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f72ba14dcda55e..017696ebe3fb65 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7716,6 +7716,24 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	dip->bytes = bio->bi_iter.bi_size;
 
 	dio_data->submitted += bio->bi_iter.bi_size;
+
+	/*
+	 * Check if we are doing a partial write.  If we are, we need to split
+	 * the ordered extent to match the submitted bio.  Hang on to the
+	 * remaining unfinishable ordered_extent in dio_data so that it can be
+	 * cancelled in iomap_end to avoid a deadlock wherein faulting the
+	 * remaining pages is blocked on the outstanding ordered extent.
+	 */
+	if (iter->flags & IOMAP_WRITE) {
+		int err;
+
+		err = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
+		if (err) {
+			btrfs_bio_end_io(bbio, errno_to_blk_status(err));
+			return;
+		}
+	}
+
 	btrfs_submit_bio(bbio, 0);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 02/11] btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
  2023-03-24  2:31 ` [PATCH 02/11] btrfs: pass flags as unsigned long to btrfs_add_ordered_extent Christoph Hellwig
@ 2023-03-24  4:53   ` Naohiro Aota
  0 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2023-03-24  4:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Boris Burkov,
	Johannes Thumshirn, linux-btrfs

On Fri, Mar 24, 2023 at 10:31:58AM +0800, Christoph Hellwig wrote:
> From: Boris Burkov <boris@bur.io>
> 
> The ordered_extent flags are declared as unsigned long, so pass them as
> such to btrfs_add_ordered_extent.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> [hch: split from a larger patch]
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good,
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 01/11] btrfs: add function to create and return an ordered extent
  2023-03-24  2:31 ` [PATCH 01/11] btrfs: add function to create and return an ordered extent Christoph Hellwig
@ 2023-03-24  5:47   ` Naohiro Aota
  2023-03-25  8:22     ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Naohiro Aota @ 2023-03-24  5:47 UTC (permalink / raw)
  To: Christoph Hellwig, Boris Burkov
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn,
	linux-btrfs, Filipe Manana

On Fri, Mar 24, 2023 at 10:31:57AM +0800, Christoph Hellwig wrote:
> From: Boris Burkov <boris@bur.io>
> 
> Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds
> it to the rb_tree, but doesn't return a referenced pointer to the
> caller. There are cases where it is useful for the creator of a new
> ordered_extent to hang on to such a pointer, so add a new function
> btrfs_alloc_ordered_extent which is the same as
> btrfs_add_ordered_extent, except it takes an additional reference count
> and returns a pointer to the ordered_extent. Implement
> btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by
> dropping the new reference and handling the IS_ERR case.
> 
> The type of flags in btrfs_alloc_ordered_extent and
> btrfs_add_ordered_extent is changed from unsigned int to unsigned long
> so it's unified with the other ordered extent functions.
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Boris Burkov <boris@bur.io>
> Signed-off-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/ordered-data.c | 46 +++++++++++++++++++++++++++++++++--------
>  fs/btrfs/ordered-data.h |  5 +++++
>  2 files changed, 42 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 6c24b69e2d0a37..83a51c692406ab 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -160,14 +160,16 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
>   * @compress_type:   Compression algorithm used for data.
>   *
>   * Most of these parameters correspond to &struct btrfs_file_extent_item. The
> - * tree is given a single reference on the ordered extent that was inserted.
> + * tree is given a single reference on the ordered extent that was inserted, and
> + * the returned pointer is given a second reference.
>   *
> - * Return: 0 or -ENOMEM.
> + * Return: the new ordered extent or ERR_PTR(-ENOMEM).
>   */
> -int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
> -			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
> -			     u64 disk_num_bytes, u64 offset, unsigned flags,
> -			     int compress_type)
> +struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
> +			struct btrfs_inode *inode, u64 file_offset,
> +			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
> +			u64 disk_num_bytes, u64 offset, unsigned long flags,
> +			int compress_type)

I'm sorry to write a comment on this late, but isn't the function name
confusing?  As I suggested a function only to allocate and initialize a
btrfs_ordered_extent in the previous mail [1], I first thought this
function is something like that.

[1] https://lore.kernel.org/linux-btrfs/20230323083608.m2ut2whbk2smjjpu@naota-xeon/

But, both btrfs_alloc_ordered_extent() and btrfs_add_ordered_extent() "add
an ordered extent to the per-inode tree." The difference is that
btrfs_alloc_ordered_extent() returns the created ordered extent to the
caller taking a reference for them...

However, I can't think of a different name, so it might be OK...

Other than that,

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

>  {
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_fs_info *fs_info = root->fs_info;
> @@ -181,7 +183,7 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
>  		/* For nocow write, we can release the qgroup rsv right now */
>  		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
>  		if (ret < 0)
> -			return ret;
> +			return ERR_PTR(ret);
>  		ret = 0;
>  	} else {
>  		/*
> @@ -190,11 +192,11 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
>  		 */
>  		ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes);
>  		if (ret < 0)
> -			return ret;
> +			return ERR_PTR(ret);
>  	}
>  	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
>  	if (!entry)
> -		return -ENOMEM;
> +		return ERR_PTR(-ENOMEM);
>  
>  	entry->file_offset = file_offset;
>  	entry->num_bytes = num_bytes;
> @@ -256,6 +258,32 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
>  	btrfs_mod_outstanding_extents(inode, 1);
>  	spin_unlock(&inode->lock);
>  
> +	/* One ref for the returned entry to match semantics of lookup. */
> +	refcount_inc(&entry->refs);
> +
> +	return entry;
> +}
> +
> +/*
> + * Add a new btrfs_ordered_extent for the range, but drop the reference instead
> + * of returning it to the caller.
> + */
> +int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
> +			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
> +			     u64 disk_num_bytes, u64 offset, unsigned flags,
> +			     int compress_type)
> +{
> +	struct btrfs_ordered_extent *ordered;
> +
> +	ordered = btrfs_alloc_ordered_extent(inode, file_offset, num_bytes,
> +					     ram_bytes, disk_bytenr,
> +					     disk_num_bytes, offset, flags,
> +					     compress_type);
> +
> +	if (IS_ERR(ordered))
> +		return PTR_ERR(ordered);
> +	btrfs_put_ordered_extent(ordered);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index eb40cb39f842e6..c00a5a3f060fa2 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -178,6 +178,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>  bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
>  				    struct btrfs_ordered_extent **cached,
>  				    u64 file_offset, u64 io_size);
> +struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
> +			struct btrfs_inode *inode, u64 file_offset,
> +			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
> +			u64 disk_num_bytes, u64 offset, unsigned long flags,
> +			int compress_type);
>  int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
>  			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
>  			     u64 disk_num_bytes, u64 offset, unsigned flags,
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/11] btrfs: simplify btrfs_extract_ordered_extent
  2023-03-24  2:32 ` [PATCH 05/11] btrfs: simplify btrfs_extract_ordered_extent Christoph Hellwig
@ 2023-03-24  6:07   ` Naohiro Aota
  2023-03-25  8:34     ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Naohiro Aota @ 2023-03-24  6:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Boris Burkov,
	Johannes Thumshirn, linux-btrfs

On Fri, Mar 24, 2023 at 10:32:01AM +0800, Christoph Hellwig wrote:
> btrfs_extract_ordered_extent must always be called for the beginning of
> an ordered_extent.  Add an assert to check for that and simplify the
> calculation of the split ranges for btrfs_split_ordered_extent and
> split_zoned_em.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/inode.c | 29 +++++++++++++----------------
>  1 file changed, 13 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 4f2f1aafd1720e..2cbc6c316effc1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2632,39 +2632,36 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
>  	u64 len = bbio->bio.bi_iter.bi_size;
>  	struct btrfs_inode *inode = bbio->inode;
>  	struct btrfs_ordered_extent *ordered;
> -	u64 file_len;
> -	u64 end = start + len;
> -	u64 ordered_end;
> -	u64 pre, post;
> +	u64 ordered_len;
>  	int ret = 0;
>  
>  	ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset);
>  	if (WARN_ON_ONCE(!ordered))
>  		return BLK_STS_IOERR;
> +	ordered_len = ordered->num_bytes;
>  
> -	/* No need to split */
> -	if (ordered->disk_num_bytes == len)
> +	/* Must always be called for the beginning of an ordered extent. */
> +	if (WARN_ON_ONCE(start != ordered->disk_bytenr)) {
> +		ret = -EINVAL;
>  		goto out;
> +	}
>  
> -	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
> -	/* bio must be in one ordered extent */
> -	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
> +	/* The bio must be entirely covered by the ordered extent */
> +	if (WARN_ON_ONCE(len > ordered_len)) {
>  		ret = -EINVAL;
>  		goto out;
>  	}
>  
> -	file_len = ordered->num_bytes;
> -	pre = start - ordered->disk_bytenr;
> -	post = ordered_end - end;
> +	/* No need to split if the ordered extent covers the entire bio */
> +	if (ordered->disk_num_bytes == len)
> +		goto out;

I thought this can go up to catch the non-split case early, but the above
check will be dropped in the following patch, so it's OK.

> -	ret = btrfs_split_ordered_extent(ordered, pre, post);
> +	ret = btrfs_split_ordered_extent(ordered, len, 0);

The next patch will overwrite this anyway, but the pre and post are the
length of new ordered extents which are not covered by the bio. So, giving
them (len, 0) breaks the semantics so that a new ordered extent is
allocated for the bio range. They should be (0, ordered_len - len).

>  	if (ret)
>  		goto out;
> -	ret = split_zoned_em(inode, bbio->file_offset, file_len, pre, post);
> -
> +	ret = split_zoned_em(inode, bbio->file_offset, ordered_len, len, 0);

Same here.

But, it is harmless unless a bisecting will fall into these commits.

>  out:
>  	btrfs_put_ordered_extent(ordered);
> -
>  	return errno_to_blk_status(ret);
>  }
>  
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 03/11] btrfs: stash ordered extent in dio_data during iomap dio
  2023-03-24  2:31 ` [PATCH 03/11] btrfs: stash ordered extent in dio_data during iomap dio Christoph Hellwig
@ 2023-03-24  7:41   ` Naohiro Aota
  0 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2023-03-24  7:41 UTC (permalink / raw)
  To: Boris Burkov, Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Johannes Thumshirn, linux-btrfs

On Fri, Mar 24, 2023 at 10:31:59AM +0800, Christoph Hellwig wrote:
> From: Boris Burkov <boris@bur.io>
> 
> While it is not feasible for an ordered extent to survive across the
> calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful
> to stash it on the dio_data in between creating it in iomap_begin and
> finishing it in either end_io or iomap_end.
> 
> The specific use I have in mind is that we can check if a partcular bio
> is partial in submit_io without unconditionally looking up the ordered
> extent. This is a preparatory patch for a later patch which does just
> that.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/inode.c | 37 ++++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 865d56ff2ce150..a92f482e898950 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -81,6 +81,7 @@ struct btrfs_dio_data {
>  	struct extent_changeset *data_reserved;
>  	bool data_space_reserved;
>  	bool nocow_done;
> +	struct btrfs_ordered_extent *ordered;
>  };
>  
>  struct btrfs_dio_private {
> @@ -6965,6 +6966,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>  }
>  
>  static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
> +						  struct btrfs_dio_data *dio_data,
>  						  const u64 start,
>  						  const u64 len,
>  						  const u64 orig_start,
> @@ -6975,7 +6977,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>  						  const int type)
>  {
>  	struct extent_map *em = NULL;
> -	int ret;
> +	struct btrfs_ordered_extent *ordered;
>  
>  	if (type != BTRFS_ORDERED_NOCOW) {
>  		em = create_io_em(inode, start, len, orig_start, block_start,
> @@ -6985,18 +6987,21 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>  		if (IS_ERR(em))
>  			goto out;
>  	}
> -	ret = btrfs_add_ordered_extent(inode, start, len, len, block_start,
> -				       block_len, 0,
> -				       (1 << type) |
> -				       (1 << BTRFS_ORDERED_DIRECT),
> -				       BTRFS_COMPRESS_NONE);
> -	if (ret) {
> +	ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
> +					     block_start, block_len, 0,
> +					     (1 << type) |
> +					     (1 << BTRFS_ORDERED_DIRECT),
> +					     BTRFS_COMPRESS_NONE);
> +	if (IS_ERR(ordered)) {
>  		if (em) {
>  			free_extent_map(em);
>  			btrfs_drop_extent_map_range(inode, start,
>  						    start + len - 1, false);
>  		}
> -		em = ERR_PTR(ret);
> +		em = ERR_PTR(PTR_ERR(ordered));

We can use "em = ERR_CAST(ordered);" here ?

> +	} else {
> +		ASSERT(!dio_data->ordered);
> +		dio_data->ordered = ordered;
>  	}
>   out:
>  

Other than that,

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 06/11] btrfs: simplify btrfs_split_ordered_extent
  2023-03-24  2:32 ` [PATCH 06/11] btrfs: simplify btrfs_split_ordered_extent Christoph Hellwig
@ 2023-03-24  7:52   ` Naohiro Aota
  2023-03-25  8:37     ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Naohiro Aota @ 2023-03-24  7:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Josef Bacik, David Sterba, Boris Burkov,
	Johannes Thumshirn, linux-btrfs

On Fri, Mar 24, 2023 at 10:32:02AM +0800, Christoph Hellwig wrote:
> btrfs_split_ordered_extent is only ever asked to split out the beginning
> of an ordered_extent.  Change it to only take a len to split out, and
> switch it to allocate the new extent for the beginning, as that helps
> with callers that want to keep a pointer to the ordered_extent that
> it is stealing from.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/inode.c        |  8 +-------
>  fs/btrfs/ordered-data.c | 28 ++++++++++++----------------
>  fs/btrfs/ordered-data.h |  3 +--
>  3 files changed, 14 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2cbc6c316effc1..bff23bac85f2ef 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2646,17 +2646,11 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
>  		goto out;
>  	}
>  
> -	/* The bio must be entirely covered by the ordered extent */
> -	if (WARN_ON_ONCE(len > ordered_len)) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> -
>  	/* No need to split if the ordered extent covers the entire bio */
>  	if (ordered->disk_num_bytes == len)
>  		goto out;
>  
> -	ret = btrfs_split_ordered_extent(ordered, len, 0);
> +	ret = btrfs_split_ordered_extent(ordered, len);
>  	if (ret)
>  		goto out;
>  	ret = split_zoned_em(inode, bbio->file_offset, ordered_len, len, 0);
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 4b46406c0c8af5..1d5971b6e68c66 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -1138,17 +1138,19 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
>  					ordered->compress_type);
>  }
>  
> -int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
> -				u64 post)
> +/* split out a new ordered extent for this first @len bytes of @ordered */
> +int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
>  {
>  	struct inode *inode = ordered->inode;
>  	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
> -	struct rb_node *node;
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	int ret = 0;
> +	struct rb_node *node;
>  
>  	trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
>  
> +	/* The bio must be entirely covered by the ordered extent */
> +	if (WARN_ON_ONCE(len > ordered->num_bytes))
> +		return -EINVAL;

Can we also reject "len == ordered->num_bytes" case here? So, we will never
modify the ordered extent to have
ordered->{num_bytes,disk_num_bytes,bytes_left} == 0.

Either way,

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

>  	/* We cannot split once end_bio'd ordered extent */
>  	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes))
>  		return -EINVAL;
> @@ -1167,11 +1169,11 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
>  	if (tree->last == node)
>  		tree->last = NULL;
>  
> -	ordered->file_offset += pre;
> -	ordered->disk_bytenr += pre;
> -	ordered->num_bytes -= (pre + post);
> -	ordered->disk_num_bytes -= (pre + post);
> -	ordered->bytes_left -= (pre + post);
> +	ordered->file_offset += len;
> +	ordered->disk_bytenr += len;
> +	ordered->num_bytes -= len;
> +	ordered->disk_num_bytes -= len;
> +	ordered->bytes_left -= len;
>  
>  	/* Re-insert the node */
>  	node = tree_insert(&tree->tree, ordered->file_offset, &ordered->rb_node);
> @@ -1182,13 +1184,7 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
>  
>  	spin_unlock_irq(&tree->lock);
>  
> -	if (pre)
> -		ret = clone_ordered_extent(ordered, 0, pre);
> -	if (ret == 0 && post)
> -		ret = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes,
> -					   post);
> -
> -	return ret;
> +	return clone_ordered_extent(ordered, 0, len);
>  }
>  
>  int __init ordered_data_init(void)
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 18007f9c00add8..f0f1138d23c331 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -212,8 +212,7 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
>  					struct extent_state **cached_state);
>  bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
>  				  struct extent_state **cached_state);
> -int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
> -			       u64 post);
> +int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len);
>  int __init ordered_data_init(void);
>  void __cold ordered_data_exit(void);
>  
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs: fix corruption caused by partial dio writes v6
  2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2023-03-24  2:32 ` [PATCH 11/11] btrfs: split partial dio bios before submit Christoph Hellwig
@ 2023-03-24 13:10 ` Johannes Thumshirn
  2023-03-24 16:36   ` Johannes Thumshirn
  11 siblings, 1 reply; 22+ messages in thread
From: Johannes Thumshirn @ 2023-03-24 13:10 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Naohiro Aota, linux-btrfs

On 24.03.23 03:32, Christoph Hellwig wrote:
> [this is a resend of the series from Boris, with my changes to avoid
>  the three-way split in btrfs_extract_ordered_extent inserted in the
>  middle]
> 
> The final patch in this series ensures that bios submitted by btrfs dio
> match the corresponding ordered_extent and extent_map exactly. As a
> result, there is no hole or deadlock as a result of partial writes, even
> if the write buffer is a partly overlapping mapping of the range being
> written to.
> 
> This required a bit of refactoring and setup. Specifically, the zoned
> device code for "extracting" an ordered extent matching a bio could be
> reused with some refactoring to return the new ordered extents after the
> split.
> 
> Changes since v5:
>  - avoid three-way splits in btrfs_extract_ordered_extent
> 

I've thrown it into my zoned test setup. I'll report back once it's
finished.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs: fix corruption caused by partial dio writes v6
  2023-03-24 13:10 ` btrfs: fix corruption caused by partial dio writes v6 Johannes Thumshirn
@ 2023-03-24 16:36   ` Johannes Thumshirn
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Thumshirn @ 2023-03-24 16:36 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba
  Cc: Boris Burkov, Naohiro Aota, linux-btrfs

On 24.03.23 14:10, Johannes Thumshirn wrote:
> On 24.03.23 03:32, Christoph Hellwig wrote:
>> [this is a resend of the series from Boris, with my changes to avoid
>>  the three-way split in btrfs_extract_ordered_extent inserted in the
>>  middle]
>>
>> The final patch in this series ensures that bios submitted by btrfs dio
>> match the corresponding ordered_extent and extent_map exactly. As a
>> result, there is no hole or deadlock as a result of partial writes, even
>> if the write buffer is a partly overlapping mapping of the range being
>> written to.
>>
>> This required a bit of refactoring and setup. Specifically, the zoned
>> device code for "extracting" an ordered extent matching a bio could be
>> reused with some refactoring to return the new ordered extents after the
>> split.
>>
>> Changes since v5:
>>  - avoid three-way splits in btrfs_extract_ordered_extent
>>
> 
> I've thrown it into my zoned test setup. I'll report back once it's
> finished.
> 

The test results look reasonable.
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 01/11] btrfs: add function to create and return an ordered extent
  2023-03-24  5:47   ` Naohiro Aota
@ 2023-03-25  8:22     ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-25  8:22 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Christoph Hellwig, Boris Burkov, Chris Mason, Josef Bacik,
	David Sterba, Johannes Thumshirn, linux-btrfs, Filipe Manana

On Fri, Mar 24, 2023 at 05:47:18AM +0000, Naohiro Aota wrote:
> I'm sorry to write a comment on this late, but isn't the function name
> confusing?  As I suggested a function only to allocate and initialize a
> btrfs_ordered_extent in the previous mail [1], I first thought this
> function is something like that.
> 
> [1] https://lore.kernel.org/linux-btrfs/20230323083608.m2ut2whbk2smjjpu@naota-xeon/
> 
> But, both btrfs_alloc_ordered_extent() and btrfs_add_ordered_extent() "add
> an ordered extent to the per-inode tree." The difference is that
> btrfs_alloc_ordered_extent() returns the created ordered extent to the
> caller taking a reference for them...
> 
> However, I can't think of a different name, so it might be OK...

I can't really think of a better name either. 

That being said, splitting out the accounting and having a version
that doesn't do it would be really useful for the split case.
And in the very long run I hope I can kill off the ordered_extent
tree entirely, but that's just futurism for now :)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/11] btrfs: simplify btrfs_extract_ordered_extent
  2023-03-24  6:07   ` Naohiro Aota
@ 2023-03-25  8:34     ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-25  8:34 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Boris Burkov, Johannes Thumshirn, linux-btrfs

On Fri, Mar 24, 2023 at 06:07:26AM +0000, Naohiro Aota wrote:
> > -	ret = btrfs_split_ordered_extent(ordered, pre, post);
> > +	ret = btrfs_split_ordered_extent(ordered, len, 0);
> 
> The next patch will overwrite this anyway, but the pre and post are the
> length of new ordered extents which are not covered by the bio. So, giving
> them (len, 0) breaks the semantics so that a new ordered extent is
> allocated for the bio range. They should be (0, ordered_len - len).

Hmm, nothing trips up in my ZNS tests even at this bisection point,
and I don't think it should matter at this stage what part is split
off.  Later we rely on the front being split off and a new
ordered_extent being allocated for the front range covering the
bio.  So maybe this just need to be documented in the commit log?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 06/11] btrfs: simplify btrfs_split_ordered_extent
  2023-03-24  7:52   ` Naohiro Aota
@ 2023-03-25  8:37     ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2023-03-25  8:37 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	Boris Burkov, Johannes Thumshirn, linux-btrfs

On Fri, Mar 24, 2023 at 07:52:49AM +0000, Naohiro Aota wrote:
> > +	/* The bio must be entirely covered by the ordered extent */
> > +	if (WARN_ON_ONCE(len > ordered->num_bytes))
> > +		return -EINVAL;
> 
> Can we also reject "len == ordered->num_bytes" case here? So, we will never
> modify the ordered extent to have
> ordered->{num_bytes,disk_num_bytes,bytes_left} == 0.

Sure, I've replaced the > with a >= and expanded the comment.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-03-25  8:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-24  2:31 btrfs: fix corruption caused by partial dio writes v6 Christoph Hellwig
2023-03-24  2:31 ` [PATCH 01/11] btrfs: add function to create and return an ordered extent Christoph Hellwig
2023-03-24  5:47   ` Naohiro Aota
2023-03-25  8:22     ` Christoph Hellwig
2023-03-24  2:31 ` [PATCH 02/11] btrfs: pass flags as unsigned long to btrfs_add_ordered_extent Christoph Hellwig
2023-03-24  4:53   ` Naohiro Aota
2023-03-24  2:31 ` [PATCH 03/11] btrfs: stash ordered extent in dio_data during iomap dio Christoph Hellwig
2023-03-24  7:41   ` Naohiro Aota
2023-03-24  2:32 ` [PATCH 04/11] btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent Christoph Hellwig
2023-03-24  2:32 ` [PATCH 05/11] btrfs: simplify btrfs_extract_ordered_extent Christoph Hellwig
2023-03-24  6:07   ` Naohiro Aota
2023-03-25  8:34     ` Christoph Hellwig
2023-03-24  2:32 ` [PATCH 06/11] btrfs: simplify btrfs_split_ordered_extent Christoph Hellwig
2023-03-24  7:52   ` Naohiro Aota
2023-03-25  8:37     ` Christoph Hellwig
2023-03-24  2:32 ` [PATCH 07/11] btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent Christoph Hellwig
2023-03-24  2:32 ` [PATCH 08/11] btrfs: simplify split_zoned_em Christoph Hellwig
2023-03-24  2:32 ` [PATCH 09/11] btrfs: pass an ordered_extent to btrfs_extract_ordered_extent Christoph Hellwig
2023-03-24  2:32 ` [PATCH 10/11] btrfs: don't split nocow extent_maps in btrfs_extract_ordered_extent Christoph Hellwig
2023-03-24  2:32 ` [PATCH 11/11] btrfs: split partial dio bios before submit Christoph Hellwig
2023-03-24 13:10 ` btrfs: fix corruption caused by partial dio writes v6 Johannes Thumshirn
2023-03-24 16:36   ` Johannes Thumshirn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).