linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/6] btrfs: fix corruption caused by partial dio writes
@ 2023-03-21 16:45 Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 1/6] btrfs: add function to create and return an ordered extent Boris Burkov
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Boris Burkov @ 2023-03-21 16:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The final patch in this series ensures that bios submitted by btrfs dio
match the corresponding ordered_extent and extent_map exactly. As a
result, there is no hole or deadlock as a result of partial writes, even
if the write buffer is a partly overlapping mapping of the range being
written to.

This required a bit of refactoring and setup. Specifically, the zoned
device code for "extracting" an ordered extent matching a bio could be
reused with some refactoring to return the new ordered extents after the
split.

Patch 1: Generic patch for returning an ordered extent while creating it
Patch 2: Cache the dio ordered extent so that we can efficiently detect
         partial writes during bio submission, without an extra lookup.
Patch 3: Light refactoring of split_zoned_em -> split_em
Patch 4: Use Patch 1 to track the new ordered_extent(s) that result from
         splitting an ordered_extent.
Patch 5: Fix a bug in ordered extent splitting
Patch 6: Use the new more general split logic of Patch 4 and the stashed
         ordered extent from Patch 2 to split partial dio bios to fix
         the corruption while avoiding the deadlock.

---
Changelog:
v4:
- significant changes; redesign the fix to use bio splitting instead of
  extending the ordered_extent lifetime across calls into iomap.
- all the oe/em splitting refactoring and fixes
v3:
- handle BTRFS_IOERR set on the ordered_extent in btrfs_dio_iomap_end.
  If the bio fails before we loop in the submission loop and exit from
  the loop early, we never submit a second bio covering the rest of the
  extent range, resulting in leaking the ordered_extent, which hangs umount.
  We can distinguish this from a short write in btrfs_dio_iomap_end by
  checking the ordered_extent.
v2:
- rename new ordered extent function
- pull the new function into a prep patch
- reorganize how the ordered_extent is stored/passed around to avoid so
many annoying memsets and exposing it to fs/btrfs/file.c
- lots of small code style improvements
- remove unintentional whitespace changes
- commit message improvements
- various ASSERTs for clarity/debugging


Boris Burkov (6):
  btrfs: add function to create and return an ordered extent
  btrfs: stash ordered extent in dio_data during iomap dio
  btrfs: repurpose split_zoned_em for partial dio splitting
  btrfs: return ordered_extent splits from bio extraction
  btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent
  btrfs: split partial dio bios before submit

 fs/btrfs/bio.c          |   2 +-
 fs/btrfs/btrfs_inode.h  |   5 +-
 fs/btrfs/inode.c        | 185 +++++++++++++++++++++++++++++-----------
 fs/btrfs/ordered-data.c |  88 ++++++++++++++-----
 fs/btrfs/ordered-data.h |  13 ++-
 5 files changed, 214 insertions(+), 79 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 1/6] btrfs: add function to create and return an ordered extent
  2023-03-21 16:45 [PATCH v4 0/6] btrfs: fix corruption caused by partial dio writes Boris Burkov
@ 2023-03-21 16:45 ` Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 2/6] btrfs: stash ordered extent in dio_data during iomap dio Boris Burkov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2023-03-21 16:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds
it to the rb_tree, but doesn't return a referenced pointer to the
caller. There are cases where it is useful for the creator of a new
ordered_extent to hang on to such a pointer, so add a new function
btrfs_alloc_ordered_extent which is the same as
btrfs_add_ordered_extent, except it takes an additional reference count
and returns a pointer to the ordered_extent. Implement
btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by
dropping the new reference and handling the IS_ERR case.

The type of flags in btrfs_alloc_ordered_extent and
btrfs_add_ordered_extent is changed from unsigned int to unsigned long
so it's unified with the other ordered extent functions.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ordered-data.c | 46 +++++++++++++++++++++++++++++++++--------
 fs/btrfs/ordered-data.h |  7 ++++++-
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6c24b69e2d0a..1848d0d1a9c4 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -160,14 +160,16 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
  * @compress_type:   Compression algorithm used for data.
  *
  * Most of these parameters correspond to &struct btrfs_file_extent_item. The
- * tree is given a single reference on the ordered extent that was inserted.
+ * tree is given a single reference on the ordered extent that was inserted, and
+ * the returned pointer is given a second reference.
  *
- * Return: 0 or -ENOMEM.
+ * Return: the new ordered extent or ERR_PTR(-ENOMEM).
  */
-int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			     u64 disk_num_bytes, u64 offset, unsigned flags,
-			     int compress_type)
+struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
+			struct btrfs_inode *inode, u64 file_offset,
+			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			u64 disk_num_bytes, u64 offset, unsigned long flags,
+			int compress_type)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -181,7 +183,7 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 		/* For nocow write, we can release the qgroup rsv right now */
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
 		if (ret < 0)
-			return ret;
+			return ERR_PTR(ret);
 		ret = 0;
 	} else {
 		/*
@@ -190,11 +192,11 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 		 */
 		ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes);
 		if (ret < 0)
-			return ret;
+			return ERR_PTR(ret);
 	}
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
 	if (!entry)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	entry->file_offset = file_offset;
 	entry->num_bytes = num_bytes;
@@ -256,6 +258,32 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 	btrfs_mod_outstanding_extents(inode, 1);
 	spin_unlock(&inode->lock);
 
+	/* One ref for the returned entry to match semantics of lookup. */
+	refcount_inc(&entry->refs);
+
+	return entry;
+}
+
+/*
+ * Add a new btrfs_ordered_extent for the range, but drop the reference instead
+ * of returning it to the caller.
+ */
+int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, unsigned long flags,
+			     int compress_type)
+{
+	struct btrfs_ordered_extent *ordered;
+
+	ordered = btrfs_alloc_ordered_extent(inode, file_offset, num_bytes,
+					     ram_bytes, disk_bytenr,
+					     disk_num_bytes, offset, flags,
+					     compress_type);
+
+	if (IS_ERR(ordered))
+		return PTR_ERR(ordered);
+	btrfs_put_ordered_extent(ordered);
+
 	return 0;
 }
 
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index eb40cb39f842..18007f9c00ad 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -178,9 +178,14 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
 				    u64 file_offset, u64 io_size);
+struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
+			struct btrfs_inode *inode, u64 file_offset,
+			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			u64 disk_num_bytes, u64 offset, unsigned long flags,
+			int compress_type);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			     u64 disk_num_bytes, u64 offset, unsigned flags,
+			     u64 disk_num_bytes, u64 offset, unsigned long flags,
 			     int compress_type);
 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 2/6] btrfs: stash ordered extent in dio_data during iomap dio
  2023-03-21 16:45 [PATCH v4 0/6] btrfs: fix corruption caused by partial dio writes Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 1/6] btrfs: add function to create and return an ordered extent Boris Burkov
@ 2023-03-21 16:45 ` Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 3/6] btrfs: repurpose split_zoned_em for partial dio splitting Boris Burkov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2023-03-21 16:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While it is not feasible for an ordered extent to survive across the
calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful
to stash it on the dio_data in between creating it in iomap_begin and
finishing it in either end_io or iomap_end.

The specific use I have in mind is that we can check if a partcular bio
is partial in submit_io without unconditionally looking up the ordered
extent. This is a preparatory patch for a later patch which does just
that.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 76d93b9e94a9..5ab486f448eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -81,6 +81,7 @@ struct btrfs_dio_data {
 	struct extent_changeset *data_reserved;
 	bool data_space_reserved;
 	bool nocow_done;
+	struct btrfs_ordered_extent *ordered;
 };
 
 struct btrfs_dio_private {
@@ -6968,6 +6969,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 }
 
 static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
+						  struct btrfs_dio_data *dio_data,
 						  const u64 start,
 						  const u64 len,
 						  const u64 orig_start,
@@ -6978,7 +6980,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  const int type)
 {
 	struct extent_map *em = NULL;
-	int ret;
+	struct btrfs_ordered_extent *ordered;
 
 	if (type != BTRFS_ORDERED_NOCOW) {
 		em = create_io_em(inode, start, len, orig_start, block_start,
@@ -6988,18 +6990,21 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 		if (IS_ERR(em))
 			goto out;
 	}
-	ret = btrfs_add_ordered_extent(inode, start, len, len, block_start,
-				       block_len, 0,
-				       (1 << type) |
-				       (1 << BTRFS_ORDERED_DIRECT),
-				       BTRFS_COMPRESS_NONE);
-	if (ret) {
+	ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
+					     block_start, block_len, 0,
+					     (1 << type) |
+					     (1 << BTRFS_ORDERED_DIRECT),
+					     BTRFS_COMPRESS_NONE);
+	if (IS_ERR(ordered)) {
 		if (em) {
 			free_extent_map(em);
 			btrfs_drop_extent_map_range(inode, start,
 						    start + len - 1, false);
 		}
-		em = ERR_PTR(ret);
+		em = ERR_PTR(PTR_ERR(ordered));
+	} else {
+		ASSERT(!dio_data->ordered);
+		dio_data->ordered = ordered;
 	}
  out:
 
@@ -7007,6 +7012,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 }
 
 static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
+						  struct btrfs_dio_data *dio_data,
 						  u64 start, u64 len)
 {
 	struct btrfs_root *root = inode->root;
@@ -7022,7 +7028,8 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 	if (ret)
 		return ERR_PTR(ret);
 
-	em = btrfs_create_dio_extent(inode, start, ins.offset, start,
+	em = btrfs_create_dio_extent(inode, dio_data,
+				     start, ins.offset, start,
 				     ins.objectid, ins.offset, ins.offset,
 				     ins.offset, BTRFS_ORDERED_REGULAR);
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
@@ -7367,7 +7374,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		}
 		space_reserved = true;
 
-		em2 = btrfs_create_dio_extent(BTRFS_I(inode), start, len,
+		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
 					      orig_start, block_start,
 					      len, orig_block_len,
 					      ram_bytes, type);
@@ -7409,7 +7416,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 			goto out;
 		space_reserved = true;
 
-		em = btrfs_new_extent_direct(BTRFS_I(inode), start, len);
+		em = btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len);
 		if (IS_ERR(em)) {
 			ret = PTR_ERR(em);
 			goto out;
@@ -7715,6 +7722,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 				      pos + length - 1, NULL);
 		ret = -ENOTBLK;
 	}
+	if (write) {
+		btrfs_put_ordered_extent(dio_data->ordered);
+		dio_data->ordered = NULL;
+	}
 
 	if (write)
 		extent_changeset_free(dio_data->data_reserved);
@@ -7776,7 +7787,7 @@ static const struct iomap_dio_ops btrfs_dio_ops = {
 
 ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_before)
 {
-	struct btrfs_dio_data data;
+	struct btrfs_dio_data data = { 0 };
 
 	return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
 			    IOMAP_DIO_PARTIAL, &data, done_before);
@@ -7785,7 +7796,7 @@ ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_be
 struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 				  size_t done_before)
 {
-	struct btrfs_dio_data data;
+	struct btrfs_dio_data data = { 0 };
 
 	return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
 			    IOMAP_DIO_PARTIAL, &data, done_before);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 3/6] btrfs: repurpose split_zoned_em for partial dio splitting
  2023-03-21 16:45 [PATCH v4 0/6] btrfs: fix corruption caused by partial dio writes Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 1/6] btrfs: add function to create and return an ordered extent Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 2/6] btrfs: stash ordered extent in dio_data during iomap dio Boris Burkov
@ 2023-03-21 16:45 ` Boris Burkov
  2023-03-21 20:33   ` Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 4/6] btrfs: return ordered_extent splits from bio extraction Boris Burkov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Boris Burkov @ 2023-03-21 16:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In a subsequent patch I will be "extracting" a partial dio write bio
from its ordered extent, creating a 1:1 bio<->ordered_extent relation.
This is necessary to avoid triggering an assertion in unpin_extent_cache
called from btrfs_finish_ordered_io that checks that the em matches the
finished ordered extent.

Since there is already a function which splits an uncompressed
extent_map for a zoned bio use case, adapt it to this new, similar
use case. Mostly, modify it to handle the case where the extent_map is
bigger than the ordered_extent, and we cannot assume the em "post" split
can be computed from the ordered_extent and bio. This comes up in
btrfs/250, for example.

I felt that these relaxations where not so damaging to the legibility of
the zoned case as to merit a fully separate codepath, but I admit that is
not my area of expertise.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c | 104 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 71 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5ab486f448eb..2f8baf4797ea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2512,37 +2512,59 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 }
 
 /*
- * Split an extent_map at [start, start + len]
+ * Split out a middle extent_map [start, start+len] from within an extent_map.
  *
- * This function is intended to be used only for extract_ordered_extent().
+ * @inode: the inode to which the extent map belongs.
+ * @start: the start index of the middle split
+ * @len: the length of the middle split
+ *
+ * The result is two or three extent_maps inserted in the tree, depending on
+ * whether start and len imply an uncovered area at the beginning or end of the
+ * extent map. If the implied split lines up with the end or beginning, there
+ * will only be two extent maps in the resulting split, otherwise there will be
+ * three. (If they both match, the split operation is a no-op)
+ *
+ * extent map splitting assumptions:
+ * end = start + len
+ * em-end = em-start + em-len
+ * start >= em-start
+ * len < em-len
+ * end <= em-end
+ *
+ * Diagrams explaining the splitting cases:
+ * original em:
+ *   [em-start---start---end---em-end)
+ * resulting ems:
+ * start != em-start && end != em-end (full tri split):
+ *   [em-start---start) [start---end) [end---em-end)
+ * start == em-start (no pre split):
+ *   [em-start---end) [end---em-end)
+ * end == em-end (no post split):
+ *   [em-start---start) [start--em-end)
+ *
+ * Returns: 0 on success, -errno on failure.
  */
-static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
-			  u64 pre, u64 post)
+static int split_em(struct btrfs_inode *inode, u64 start, u64 len)
 {
 	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
+	u64 pre_start, pre_len, pre_end;
+	u64 mid_start, mid_len, mid_end;
+	u64 post_start, post_len, post_end;
 	struct extent_map *split_pre = NULL;
 	struct extent_map *split_mid = NULL;
 	struct extent_map *split_post = NULL;
 	int ret = 0;
 	unsigned long flags;
 
-	/* Sanity check */
-	if (pre == 0 && post == 0)
-		return 0;
-
 	split_pre = alloc_extent_map();
-	if (pre)
-		split_mid = alloc_extent_map();
-	if (post)
-		split_post = alloc_extent_map();
-	if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
+	split_mid = alloc_extent_map();
+	split_post = alloc_extent_map();
+	if (!split_pre || !split_mid || !split_post) {
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	ASSERT(pre + post < len);
-
 	lock_extent(&inode->io_tree, start, start + len - 1, NULL);
 	write_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, start, len);
@@ -2551,19 +2573,38 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 		goto out_unlock;
 	}
 
-	ASSERT(em->len == len);
+	pre_start = em->start;
+	pre_end = start;
+	pre_len = pre_end - pre_start;
+	mid_start = start;
+	mid_end = start + len;
+	mid_len = len;
+	post_start = mid_end;
+	post_end = em->start + em->len;
+	post_len = post_end - post_start;
+	ASSERT(pre_start == em->start);
+	ASSERT(pre_start + pre_len == mid_start);
+	ASSERT(mid_start + mid_len == post_start);
+	ASSERT(post_start + post_len == em->start + em->len);
+
+	/* Sanity check */
+	if (pre_len == 0 && post_len == 0) {
+		ret = 0;
+		goto out_unlock;
+	}
+
 	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
 	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
-	ASSERT(test_bit(EXTENT_FLAG_PINNED, &em->flags));
 	ASSERT(!test_bit(EXTENT_FLAG_LOGGING, &em->flags));
 	ASSERT(!list_empty(&em->list));
 
 	flags = em->flags;
-	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
+	if (test_bit(EXTENT_FLAG_PINNED, &em->flags))
+		clear_bit(EXTENT_FLAG_PINNED, &em->flags);
 
 	/* First, replace the em with a new extent_map starting from * em->start */
 	split_pre->start = em->start;
-	split_pre->len = (pre ? pre : em->len - post);
+	split_pre->len = (pre_len ? pre_len : mid_len);
 	split_pre->orig_start = split_pre->start;
 	split_pre->block_start = em->block_start;
 	split_pre->block_len = split_pre->len;
@@ -2577,16 +2618,15 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 
 	/*
 	 * Now we only have an extent_map at:
-	 *     [em->start, em->start + pre] if pre != 0
-	 *     [em->start, em->start + em->len - post] if pre == 0
+	 *     [em->start, em->start + pre_len] if pre_len != 0
+	 *     [em->start, em->start + mid_len] if pre == 0
 	 */
-
-	if (pre) {
+	if (pre_len) {
 		/* Insert the middle extent_map */
-		split_mid->start = em->start + pre;
-		split_mid->len = em->len - pre - post;
+		split_mid->start = mid_start;
+		split_mid->len = mid_len;
 		split_mid->orig_start = split_mid->start;
-		split_mid->block_start = em->block_start + pre;
+		split_mid->block_start = em->block_start + pre_len;
 		split_mid->block_len = split_mid->len;
 		split_mid->orig_block_len = split_mid->block_len;
 		split_mid->ram_bytes = split_mid->len;
@@ -2596,11 +2636,11 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 		add_extent_mapping(em_tree, split_mid, 1);
 	}
 
-	if (post) {
-		split_post->start = em->start + em->len - post;
-		split_post->len = post;
+	if (post_len) {
+		split_post->start = post_start;
+		split_post->len = post_len;
 		split_post->orig_start = split_post->start;
-		split_post->block_start = em->block_start + em->len - post;
+		split_post->block_start = em->block_start + pre_len + mid_len;
 		split_post->block_len = split_post->len;
 		split_post->orig_block_len = split_post->block_len;
 		split_post->ram_bytes = split_post->len;
@@ -2632,7 +2672,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 	u64 len = bbio->bio.bi_iter.bi_size;
 	struct btrfs_inode *inode = bbio->inode;
 	struct btrfs_ordered_extent *ordered;
-	u64 file_len;
 	u64 end = start + len;
 	u64 ordered_end;
 	u64 pre, post;
@@ -2671,14 +2710,13 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 		goto out;
 	}
 
-	file_len = ordered->num_bytes;
 	pre = start - ordered->disk_bytenr;
 	post = ordered_end - end;
 
 	ret = btrfs_split_ordered_extent(ordered, pre, post);
 	if (ret)
 		goto out;
-	ret = split_zoned_em(inode, bbio->file_offset, file_len, pre, post);
+	ret = split_em(inode, bbio->file_offset, len);
 
 out:
 	btrfs_put_ordered_extent(ordered);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 4/6] btrfs: return ordered_extent splits from bio extraction
  2023-03-21 16:45 [PATCH v4 0/6] btrfs: fix corruption caused by partial dio writes Boris Burkov
                   ` (2 preceding siblings ...)
  2023-03-21 16:45 ` [PATCH v4 3/6] btrfs: repurpose split_zoned_em for partial dio splitting Boris Burkov
@ 2023-03-21 16:45 ` Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 5/6] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 6/6] btrfs: split partial dio bios before submit Boris Burkov
  5 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2023-03-21 16:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When extracting a bio from its ordered extent for dio partial writes, we
need the "remainder" ordered extent. It would be possible to look it up
in that case, but since we can grab the ordered_extent from the new
allocation function, we might as well wire it up to be returned to the
caller via out parameter and save that lookup.

Refactor the clone ordered extent function to return the new ordered
extent, then refactor the split and extract functions to pass back the
new pre and post split ordered extents via output parameter.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/bio.c          |  2 +-
 fs/btrfs/btrfs_inode.h  |  5 ++++-
 fs/btrfs/inode.c        | 23 ++++++++++++++++++-----
 fs/btrfs/ordered-data.c | 36 +++++++++++++++++++++++-------------
 fs/btrfs/ordered-data.h |  6 ++++--
 5 files changed, 50 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index cf09c6271edb..b849ced40d37 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -653,7 +653,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 		if (use_append) {
 			bio->bi_opf &= ~REQ_OP_WRITE;
 			bio->bi_opf |= REQ_OP_ZONE_APPEND;
-			ret = btrfs_extract_ordered_extent(bbio);
+			ret = btrfs_extract_ordered_extent_bio(bbio, NULL, NULL, NULL);
 			if (ret)
 				goto fail_put_bio;
 		}
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9dc21622806e..e92a09559058 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -407,7 +407,10 @@ static inline void btrfs_inode_split_flags(u64 inode_item_flags,
 
 int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
 			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
-blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio);
+blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio,
+					      struct btrfs_ordered_extent *ordered,
+					      struct btrfs_ordered_extent **ret_pre,
+					      struct btrfs_ordered_extent **ret_post);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2f8baf4797ea..dbea124c9fa3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2666,21 +2666,35 @@ static int split_em(struct btrfs_inode *inode, u64 start, u64 len)
 	return ret;
 }
 
-blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
+/*
+ * Extract a bio from an ordered extent to enforce an invariant where the bio
+ * fully matches a single ordered extent.
+ *
+ * @bbio: the bio to extract.
+ * @ordered: the ordered extent the bio is in, will be shrunk to fit. If NULL we
+ *	     will look it up.
+ * @ret_pre: out parameter to return the new oe in front of the bio, if needed.
+ * @ret_post: out parameter to return the new oe past the bio, if needed.
+ */
+blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio,
+					      struct btrfs_ordered_extent *ordered,
+					      struct btrfs_ordered_extent **ret_pre,
+					      struct btrfs_ordered_extent **ret_post)
 {
 	u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 len = bbio->bio.bi_iter.bi_size;
 	struct btrfs_inode *inode = bbio->inode;
-	struct btrfs_ordered_extent *ordered;
 	u64 end = start + len;
 	u64 ordered_end;
 	u64 pre, post;
 	int ret = 0;
 
-	ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset);
+	if (!ordered)
+		ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset);
 	if (WARN_ON_ONCE(!ordered))
 		return BLK_STS_IOERR;
 
+	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
 	/* No need to split */
 	if (ordered->disk_num_bytes == len)
 		goto out;
@@ -2697,7 +2711,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 		goto out;
 	}
 
-	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
 	/* bio must be in one ordered extent */
 	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
 		ret = -EINVAL;
@@ -2713,7 +2726,7 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 	pre = start - ordered->disk_bytenr;
 	post = ordered_end - end;
 
-	ret = btrfs_split_ordered_extent(ordered, pre, post);
+	ret = btrfs_split_ordered_extent(ordered, pre, post, ret_pre, ret_post);
 	if (ret)
 		goto out;
 	ret = split_em(inode, bbio->file_offset, len);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 1848d0d1a9c4..4bebebb9b434 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1117,8 +1117,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 }
 
 
-static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
-				u64 len)
+static struct btrfs_ordered_extent *clone_ordered_extent(struct btrfs_ordered_extent *ordered,
+						  u64 pos, u64 len)
 {
 	struct inode *inode = ordered->inode;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
@@ -1133,18 +1133,22 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
 	percpu_counter_add_batch(&fs_info->ordered_bytes, -len,
 				 fs_info->delalloc_batch);
 	WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED));
-	return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len,
-					disk_bytenr, len, 0, flags,
-					ordered->compress_type);
+	return btrfs_alloc_ordered_extent(BTRFS_I(inode), file_offset, len, len,
+					  disk_bytenr, len, 0, flags,
+					  ordered->compress_type);
 }
 
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
-				u64 post)
+int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
+			       u64 pre, u64 post,
+			       struct btrfs_ordered_extent **ret_pre,
+			       struct btrfs_ordered_extent **ret_post)
+
 {
 	struct inode *inode = ordered->inode;
 	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
 	struct rb_node *node;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_ordered_extent *oe;
 	int ret = 0;
 
 	trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
@@ -1172,12 +1176,18 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 
 	spin_unlock_irq(&tree->lock);
 
-	if (pre)
-		ret = clone_ordered_extent(ordered, 0, pre);
-	if (ret == 0 && post)
-		ret = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes,
-					   post);
-
+	if (pre) {
+		oe = clone_ordered_extent(ordered, 0, pre);
+		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
+		if (!ret && ret_pre)
+			*ret_pre = oe;
+	}
+	if (!ret && post) {
+		oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
+		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
+		if (!ret && ret_post)
+			*ret_post = oe;
+	}
 	return ret;
 }
 
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 18007f9c00ad..933f6f0d8c10 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -212,8 +212,10 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					struct extent_state **cached_state);
 bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct extent_state **cached_state);
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
-			       u64 post);
+int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
+			       u64 pre, u64 post,
+			       struct btrfs_ordered_extent **ret_pre,
+			       struct btrfs_ordered_extent **ret_post);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 5/6] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent
  2023-03-21 16:45 [PATCH v4 0/6] btrfs: fix corruption caused by partial dio writes Boris Burkov
                   ` (3 preceding siblings ...)
  2023-03-21 16:45 ` [PATCH v4 4/6] btrfs: return ordered_extent splits from bio extraction Boris Burkov
@ 2023-03-21 16:45 ` Boris Burkov
  2023-03-21 16:45 ` [PATCH v4 6/6] btrfs: split partial dio bios before submit Boris Burkov
  5 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2023-03-21 16:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

if pre != 0 in btrfs_split_ordered_extent, then we do the following:
1. remove ordered (at file_offset) from the rb tree
2. modify file_offset+=pre
3. re-insert ordered
4. clone an ordered extent at offset 0 length pre from ordered.
5. clone an ordered extent for the post range, if necessary.

step 4 is not correct, as at this point, the start of ordered is already
the end of the desired new pre extent. Further this causes a panic when
btrfs_alloc_ordered_extent sees that the node (from the modified and
re-inserted ordered) is already present at file_offset + 0 = file_offset.

We can fix this by either using a negative offset, or by moving the
clone of the pre extent to after we remove the original one, but before
we modify and re-insert it. The former feels quite kludgy, as we are
"cloning" from outside the range of the ordered extent, so opt for the
latter, which does have some locking annoyances.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/ordered-data.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4bebebb9b434..d14a3fe1a113 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1161,6 +1161,17 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
 	if (tree->last == node)
 		tree->last = NULL;
 
+	if (pre) {
+		spin_unlock_irq(&tree->lock);
+		oe = clone_ordered_extent(ordered, 0, pre);
+		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
+		if (!ret && ret_pre)
+			*ret_pre = oe;
+		if (ret)
+			goto out;
+		spin_lock_irq(&tree->lock);
+	}
+
 	ordered->file_offset += pre;
 	ordered->disk_bytenr += pre;
 	ordered->num_bytes -= (pre + post);
@@ -1176,18 +1187,13 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
 
 	spin_unlock_irq(&tree->lock);
 
-	if (pre) {
-		oe = clone_ordered_extent(ordered, 0, pre);
-		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
-		if (!ret && ret_pre)
-			*ret_pre = oe;
-	}
-	if (!ret && post) {
+	if (post) {
 		oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
 		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
 		if (!ret && ret_post)
 			*ret_post = oe;
 	}
+out:
 	return ret;
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 6/6] btrfs: split partial dio bios before submit
  2023-03-21 16:45 [PATCH v4 0/6] btrfs: fix corruption caused by partial dio writes Boris Burkov
                   ` (4 preceding siblings ...)
  2023-03-21 16:45 ` [PATCH v4 5/6] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent Boris Burkov
@ 2023-03-21 16:45 ` Boris Burkov
  5 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2023-03-21 16:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this codepath in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.

The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:

====WRITING TASK====
btrfs_direct_write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create full ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # page fault; partial read
submit_bio # partial bio
iomap_iter
btrfs_dio_iomap_end
btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
			       # submit to finish_ordered_fn wq
fault_in_iov_iter_readable # btrfs_direct_write detects partial write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create second partial ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # read all of remainder
submit_bio # partial bio with all of remainder
iomap_iter
btrfs_dio_iomap_end # nothing exciting to do with ordered io

====DIO ENDIO====
==FIRST PARTIAL BIO==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left > 0
			     # don't submit to finish_ordered_fn wq
==SECOND PARTIAL BIO==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left == 0
			     # submit to finish_ordered_fn wq

====BTRFS FINISH ORDERED WQ====
==FIRST PARTIAL BIO==
btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
		    # BTRFS_ORDERED_IOERR, just drops the
		    # ordered_extent
==SECOND PARTIAL BIO==
btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
		    # extents, csums, etc...

The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.

More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc...
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.

We considered a few options for fixing the bug (apologies for any
incorrect summary of a proposal which I didn't implement and fully
understand):
1. treat the partial bio as if we had truncated the file, which would
result in properly finishing it.
2. split the ordered extent when submitting a partial bio.
3. cache the ordered extent across calls to __iomap_dio_rw in
iter->private, so that we could reuse it and correctly apply several
bios to it.

I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below)

Therefore, go with fix #2, which requires a bit more setup work but
fixes the corruption without introducing the deadlock, which is
fundamentally caused by the ordered extent existing when we attempt to
fault in a range that overlaps with it.

Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.

Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.

Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
Link: https://pastebin.com/3SDaH8C6
Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dbea124c9fa3..69fdcbb89522 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7815,6 +7815,7 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	struct btrfs_dio_private *dip =
 		container_of(bbio, struct btrfs_dio_private, bbio);
 	struct btrfs_dio_data *dio_data = iter->private;
+	int err = 0;
 
 	btrfs_bio_init(bbio, BTRFS_I(iter->inode), btrfs_dio_end_io, bio->bi_private);
 	bbio->file_offset = file_offset;
@@ -7823,7 +7824,25 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	dip->bytes = bio->bi_iter.bi_size;
 
 	dio_data->submitted += bio->bi_iter.bi_size;
-	btrfs_submit_bio(bbio, 0);
+	/*
+	 * Check if we are doing a partial write. If we are, we need to split
+	 * the ordered extent to match the submitted bio. Hang on to the
+	 * remaining unfinishable ordered_extent in dio_data so that it can be
+	 * cancelled in iomap_end to avoid a deadlock wherein faulting the
+	 * remaining pages is blocked on the outstanding ordered extent.
+	 */
+	if (iter->flags & IOMAP_WRITE) {
+		struct btrfs_ordered_extent *ordered = dio_data->ordered;
+
+		ASSERT(ordered);
+		if (bio->bi_iter.bi_size < ordered->num_bytes)
+			err = btrfs_extract_ordered_extent_bio(bbio, ordered, NULL,
+							       &dio_data->ordered);
+	}
+	if (err)
+		btrfs_bio_end_io(bbio, err);
+	else
+		btrfs_submit_bio(bbio, 0);
 }
 
 static const struct iomap_ops btrfs_dio_iomap_ops = {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 3/6] btrfs: repurpose split_zoned_em for partial dio splitting
  2023-03-21 16:45 ` [PATCH v4 3/6] btrfs: repurpose split_zoned_em for partial dio splitting Boris Burkov
@ 2023-03-21 20:33   ` Boris Burkov
  0 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2023-03-21 20:33 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

On Tue, Mar 21, 2023 at 09:45:30AM -0700, Boris Burkov wrote:
> In a subsequent patch I will be "extracting" a partial dio write bio
> from its ordered extent, creating a 1:1 bio<->ordered_extent relation.
> This is necessary to avoid triggering an assertion in unpin_extent_cache
> called from btrfs_finish_ordered_io that checks that the em matches the
> finished ordered extent.
> 
> Since there is already a function which splits an uncompressed
> extent_map for a zoned bio use case, adapt it to this new, similar
> use case. Mostly, modify it to handle the case where the extent_map is
> bigger than the ordered_extent, and we cannot assume the em "post" split
> can be computed from the ordered_extent and bio. This comes up in
> btrfs/250, for example.
> 
> I felt that these relaxations where not so damaging to the legibility of
> the zoned case as to merit a fully separate codepath, but I admit that is
> not my area of expertise.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/inode.c | 104 ++++++++++++++++++++++++++++++++---------------
>  1 file changed, 71 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5ab486f448eb..2f8baf4797ea 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2512,37 +2512,59 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
>  }
>  
>  /*
> - * Split an extent_map at [start, start + len]
> + * Split out a middle extent_map [start, start+len] from within an extent_map.
>   *
> - * This function is intended to be used only for extract_ordered_extent().
> + * @inode: the inode to which the extent map belongs.
> + * @start: the start index of the middle split
> + * @len: the length of the middle split
> + *
> + * The result is two or three extent_maps inserted in the tree, depending on
> + * whether start and len imply an uncovered area at the beginning or end of the
> + * extent map. If the implied split lines up with the end or beginning, there
> + * will only be two extent maps in the resulting split, otherwise there will be
> + * three. (If they both match, the split operation is a no-op)
> + *
> + * extent map splitting assumptions:
> + * end = start + len
> + * em-end = em-start + em-len
> + * start >= em-start
> + * len < em-len
> + * end <= em-end
> + *
> + * Diagrams explaining the splitting cases:
> + * original em:
> + *   [em-start---start---end---em-end)
> + * resulting ems:
> + * start != em-start && end != em-end (full tri split):
> + *   [em-start---start) [start---end) [end---em-end)
> + * start == em-start (no pre split):
> + *   [em-start---end) [end---em-end)
> + * end == em-end (no post split):
> + *   [em-start---start) [start--em-end)
> + *
> + * Returns: 0 on success, -errno on failure.
>   */
> -static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
> -			  u64 pre, u64 post)
> +static int split_em(struct btrfs_inode *inode, u64 start, u64 len)
>  {
>  	struct extent_map_tree *em_tree = &inode->extent_tree;
>  	struct extent_map *em;
> +	u64 pre_start, pre_len, pre_end;
> +	u64 mid_start, mid_len, mid_end;
> +	u64 post_start, post_len, post_end;
>  	struct extent_map *split_pre = NULL;
>  	struct extent_map *split_mid = NULL;
>  	struct extent_map *split_post = NULL;
>  	int ret = 0;
>  	unsigned long flags;
>  
> -	/* Sanity check */
> -	if (pre == 0 && post == 0)
> -		return 0;
> -
>  	split_pre = alloc_extent_map();
> -	if (pre)
> -		split_mid = alloc_extent_map();
> -	if (post)
> -		split_post = alloc_extent_map();
> -	if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
> +	split_mid = alloc_extent_map();
> +	split_post = alloc_extent_map();
> +	if (!split_pre || !split_mid || !split_post) {
>  		ret = -ENOMEM;
>  		goto out;
>  	}
>  
> -	ASSERT(pre + post < len);
> -
>  	lock_extent(&inode->io_tree, start, start + len - 1, NULL);
>  	write_lock(&em_tree->lock);
>  	em = lookup_extent_mapping(em_tree, start, len);
> @@ -2551,19 +2573,38 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
>  		goto out_unlock;
>  	}
>  
> -	ASSERT(em->len == len);
> +	pre_start = em->start;
> +	pre_end = start;
> +	pre_len = pre_end - pre_start;
> +	mid_start = start;
> +	mid_end = start + len;
> +	mid_len = len;
> +	post_start = mid_end;
> +	post_end = em->start + em->len;
> +	post_len = post_end - post_start;
> +	ASSERT(pre_start == em->start);
> +	ASSERT(pre_start + pre_len == mid_start);
> +	ASSERT(mid_start + mid_len == post_start);
> +	ASSERT(post_start + post_len == em->start + em->len);
> +
> +	/* Sanity check */
> +	if (pre_len == 0 && post_len == 0) {
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +
>  	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
>  	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
> -	ASSERT(test_bit(EXTENT_FLAG_PINNED, &em->flags));

I am currently digging into this more, as I think we want the em to be
pinned. The length is likely the same issue, so maybe I can figure them
both out. I was seeing it violated on btrfs/250.

>  	ASSERT(!test_bit(EXTENT_FLAG_LOGGING, &em->flags));
>  	ASSERT(!list_empty(&em->list));
>  
>  	flags = em->flags;
> -	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
> +	if (test_bit(EXTENT_FLAG_PINNED, &em->flags))
> +		clear_bit(EXTENT_FLAG_PINNED, &em->flags);
>  
>  	/* First, replace the em with a new extent_map starting from * em->start */
>  	split_pre->start = em->start;
> -	split_pre->len = (pre ? pre : em->len - post);
> +	split_pre->len = (pre_len ? pre_len : mid_len);
>  	split_pre->orig_start = split_pre->start;
>  	split_pre->block_start = em->block_start;
>  	split_pre->block_len = split_pre->len;
> @@ -2577,16 +2618,15 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
>  
>  	/*
>  	 * Now we only have an extent_map at:
> -	 *     [em->start, em->start + pre] if pre != 0
> -	 *     [em->start, em->start + em->len - post] if pre == 0
> +	 *     [em->start, em->start + pre_len] if pre_len != 0
> +	 *     [em->start, em->start + mid_len] if pre == 0
>  	 */
> -
> -	if (pre) {
> +	if (pre_len) {
>  		/* Insert the middle extent_map */
> -		split_mid->start = em->start + pre;
> -		split_mid->len = em->len - pre - post;
> +		split_mid->start = mid_start;
> +		split_mid->len = mid_len;
>  		split_mid->orig_start = split_mid->start;
> -		split_mid->block_start = em->block_start + pre;
> +		split_mid->block_start = em->block_start + pre_len;
>  		split_mid->block_len = split_mid->len;
>  		split_mid->orig_block_len = split_mid->block_len;
>  		split_mid->ram_bytes = split_mid->len;
> @@ -2596,11 +2636,11 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
>  		add_extent_mapping(em_tree, split_mid, 1);
>  	}
>  
> -	if (post) {
> -		split_post->start = em->start + em->len - post;
> -		split_post->len = post;
> +	if (post_len) {
> +		split_post->start = post_start;
> +		split_post->len = post_len;
>  		split_post->orig_start = split_post->start;
> -		split_post->block_start = em->block_start + em->len - post;
> +		split_post->block_start = em->block_start + pre_len + mid_len;
>  		split_post->block_len = split_post->len;
>  		split_post->orig_block_len = split_post->block_len;
>  		split_post->ram_bytes = split_post->len;
> @@ -2632,7 +2672,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
>  	u64 len = bbio->bio.bi_iter.bi_size;
>  	struct btrfs_inode *inode = bbio->inode;
>  	struct btrfs_ordered_extent *ordered;
> -	u64 file_len;
>  	u64 end = start + len;
>  	u64 ordered_end;
>  	u64 pre, post;
> @@ -2671,14 +2710,13 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
>  		goto out;
>  	}
>  
> -	file_len = ordered->num_bytes;
>  	pre = start - ordered->disk_bytenr;
>  	post = ordered_end - end;
>  
>  	ret = btrfs_split_ordered_extent(ordered, pre, post);
>  	if (ret)
>  		goto out;
> -	ret = split_zoned_em(inode, bbio->file_offset, file_len, pre, post);
> +	ret = split_em(inode, bbio->file_offset, len);
>  
>  out:
>  	btrfs_put_ordered_extent(ordered);
> -- 
> 2.38.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-03-21 20:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-21 16:45 [PATCH v4 0/6] btrfs: fix corruption caused by partial dio writes Boris Burkov
2023-03-21 16:45 ` [PATCH v4 1/6] btrfs: add function to create and return an ordered extent Boris Burkov
2023-03-21 16:45 ` [PATCH v4 2/6] btrfs: stash ordered extent in dio_data during iomap dio Boris Burkov
2023-03-21 16:45 ` [PATCH v4 3/6] btrfs: repurpose split_zoned_em for partial dio splitting Boris Burkov
2023-03-21 20:33   ` Boris Burkov
2023-03-21 16:45 ` [PATCH v4 4/6] btrfs: return ordered_extent splits from bio extraction Boris Burkov
2023-03-21 16:45 ` [PATCH v4 5/6] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent Boris Burkov
2023-03-21 16:45 ` [PATCH v4 6/6] btrfs: split partial dio bios before submit Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).