[PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate
@ 2022-06-06  9:41 fdmanana
  2022-06-06  9:41 ` [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion fdmanana
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: fdmanana @ 2022-06-06  9:41 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

This patchset fixes a couple of bugs, the first one exclusive for reflink
operations while the second one applies to reflink, fallocate and hole
punching operations. Finally the third patch just removes some BUG_ON()s
that are really not needed.
All the patches are independent of each other, they could have been sent
out as individual patches.

Filipe Manana (3):
  btrfs: fix race between reflinking and ordered extent completion
  btrfs: add missing inode updates on each iteration when replacing
    extents
  btrfs: do not BUG_ON() on failure to migrate space when replacing
    extents

 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/file.c    | 25 +++++++++++++++++++++++--
 fs/btrfs/inode.c   |  1 +
 fs/btrfs/reflink.c | 16 ++++++++++++----
 4 files changed, 38 insertions(+), 6 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion
  2022-06-06  9:41 [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate fdmanana
@ 2022-06-06  9:41 ` fdmanana
  2022-06-06 21:36   ` Boris Burkov
  2022-06-06  9:41 ` [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents fdmanana
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: fdmanana @ 2022-06-06  9:41 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

While doing a reflink operation, if an ordered extent for a file range
that does not overlap with the source and destination ranges of the
reflink operation happens, we can end up having a failure in the reflink
operation and return -EINVAL to user space.

The following sequence of steps explains how this can happen:

1) We have the page at file offset 315392 dirty (under delalloc);

2) A reflink operation for this file starts, using the same file as both
   source and destination, the source range is [372736, 409600[ (length of
   36864 bytes) and the destination range is [208896, 245760[;

3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
   and destination ranges, and wait for any ordered extents in those range
   to complete;

4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
   the inode, but we neither wait for it to complete nor any ordered
   extents to complete. This results in starting delalloc for the page at
   file offset 315392 and creating an ordered extent for that single page
   range;

5) We then move to btrfs_clone() and enter the loop to find file extent
   items to copy from the source range to destination range;

6) In the first iteration we end up at last file extent item stored in
   leaf A:

   (...)
   item 131 key (143616 108 315392) itemoff 5101 itemsize 53
            extent data disk bytenr 1903988736 nr 73728
            extent data offset 12288 nr 61440 ram 73728

   This represents the file range [315392, 376832[, which overlaps with
   the source range to clone.

   @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
   is therefore set to 376832 (315392 + 61440).

   @off (372736) is > key.offset (315392), so @new_key.offset is set to
   the value of @destoff (208896).

   @new_key.offset == @last_dest_end (208896) so @drop_start is set to
   208896 (@new_key.offset).

   @datal is adjusted to 4096, as @off is > @key.offset.

   So in this iteration we call btrfs_replace_file_extents() for the range
   [208896, 212991] (a single page, which is
   [@drop_start, @new_key.offset + @datal - 1]).

   @last_dest_end is set to 212992 (@new_key.offset + @datal =
   208896 + 4096 = 212992).

   Before the next iteration of the loop, @key.offset is set to the value
   376832, which is @next_key_min_offset;

7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
   but this time pointing beyond the last slot of leaf A, as that's where
   a key with offset 376832 should be at if it existed. So end up calling
   btrfs_next_leaf();

8) btrfs_next_leaf() releases the path, but before it searches again the
   tree for the next key/leaf, the ordered extent for the single page
   range at file offset 315392 completes. That results in trimming the
   file extent item we processed before, adjusting its key offset from
   315392 to 319488, reducing its length from 61440 to 57344 and inserting
   a new file extent item for that single page range, with a key offset of
   315392 and a length of 4096.

   Leaf A now looks like:

     (...)
     item 132 key (143616 108 315392) itemoff 4995 itemsize 53
              extent data disk bytenr 1801666560 nr 4096
              extent data offset 0 nr 4096 ram 4096
     item 133 key (143616 108 319488) itemoff 4942 itemsize 53
              extent data disk bytenr 1903988736 nr 73728
              extent data offset 16384 nr 57344 ram 73728

9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
   at slot 133, since it's the first key that follows what was the last
   key we saw (143616 108 315392). In fact it's the same item we processed
   before, but its key offset was changed, so it counts as a new key;

10) So now we have:

    @key.offset == 319488
    @datal == 57344

    @off (372736) is > key.offset (319488), so @new_key.offset is set to
    208896 (@destoff value).

    @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
    is set to 212992 (@last_dest_end value).

    @datal is adjusted to 4096 because @off > @key.offset.

    So in this iteration we call btrfs_replace_file_extents() for the
    invalid range of [212992, 212991] (which is
    [@drop_start, @new_key.offset + @datal - 1]).

    This range is empty, the end offset is smaller than the start offset
    so btrfs_replace_file_extents() returns -EINVAL, which we end up
    returning to user space and fail the reflink operation.

    This all happens because the range of this file extent item was
    already processed in the previous iteration.

This scenario can be triggered very sporadically by fsx from fstests, for
example with test case generic/522.

So fix this by having btrfs_clone() skip file extent items that cover a
file range that we have already processed.

CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/reflink.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index c0f1456be998..7e3b0aa318c1 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -345,6 +345,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 	int ret;
 	const u64 len = olen_aligned;
 	u64 last_dest_end = destoff;
+	u64 prev_extent_end = off;

 	ret = -ENOMEM;
 	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
@@ -364,7 +365,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 	key.offset = off;

 	while (1) {
-		u64 next_key_min_offset = key.offset + 1;
 		struct btrfs_file_extent_item *extent;
 		u64 extent_gen;
 		int type;
@@ -432,14 +432,21 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 		 * The first search might have left us at an extent item that
 		 * ends before our target range's start, can happen if we have
 		 * holes and NO_HOLES feature enabled.
+		 *
+		 * Subsequent searches may leave us on a file range we have
+		 * processed before - this happens due to a race with ordered
+		 * extent completion for a file range that is outside our source
+		 * range, but that range was part of a file extent item that
+		 * also covered a leading part of our source range.
 		 */
-		if (key.offset + datal <= off) {
+		if (key.offset + datal <= prev_extent_end) {
 			path->slots[0]++;
 			goto process_slot;
 		} else if (key.offset >= off + len) {
 			break;
 		}
-		next_key_min_offset = key.offset + datal;
+
+		prev_extent_end = key.offset + datal;
 		size = btrfs_item_size(leaf, slot);
 		read_extent_buffer(leaf, buf, btrfs_item_ptr_offset(leaf, slot),
 				   size);
@@ -551,7 +558,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 			break;

 		btrfs_release_path(path);
-		key.offset = next_key_min_offset;
+		key.offset = prev_extent_end;

 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents
  2022-06-06  9:41 [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate fdmanana
  2022-06-06  9:41 ` [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion fdmanana
@ 2022-06-06  9:41 ` fdmanana
  2022-06-06 22:11   ` Boris Burkov
  2022-06-06  9:41 ` [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space " fdmanana
  2022-06-06 20:45 ` [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate David Sterba
  3 siblings, 1 reply; 10+ messages in thread
From: fdmanana @ 2022-06-06  9:41 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When replacing file extents, called during fallocate, hole punching,
clone and deduplication, we may not be able to replace/drop all the
target file extent items with a single transaction handle. We may get
-ENOSPC while doing it, in which case we release the transaction handle,
balance the dirty pages of the btree inode, flush delayed items and get
a new transaction handle to operate on what's left of the target range.

By dropping and replacing file extent items we have effectively modified
the inode, so we should bump its iversion and update its mtime/ctime
before we update the inode item. This is because if the transaction
we used for partially modifying the inode gets committed by someone after
we release it and before we finish the rest of the range, a power failure
happens, then after mounting the filesystem our inode has an outdated
iversion and mtime/ctime, corresponding to the values it had before we
changed it.

So add the missing iversion and mtime/ctime updates.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/file.c    | 19 +++++++++++++++++++
 fs/btrfs/inode.c   |  1 +
 fs/btrfs/reflink.c |  1 +
 4 files changed, 23 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 55dee1564e90..737cd59d16b6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1330,6 +1330,8 @@ struct btrfs_replace_extent_info {
 	 * existing extent into a file range.
 	 */
 	bool is_new_extent;
+	/* Indicate if we should update the inode's mtime and ctime. */
+	bool update_times;
 	/* Meaningful only if is_new_extent is true. */
 	int qgroup_reserved;
 	/*
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1fd827b99c1b..29de433b7804 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2803,6 +2803,25 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
 			extent_info->file_offset += replace_len;
 		}
 
+		/*
+		 * We are releasing our handle on the transaction, balance the
+		 * dirty pages of the btree inode and flush delayed items, and
+		 * then get a new transaction handle, which may now point to a
+		 * new transaction in case someone else may have committed the
+		 * transaction we used to replace/drop file extent items. So
+		 * bump the inode's iversion and update mtime and ctime except
+		 * if we are called from a dedupe context. This is because a
+		 * power failure/crash may happen after the transaction is
+		 * committed and before we finish replacing/dropping all the
+		 * file extent items we need.
+		 */
+		inode_inc_iversion(&inode->vfs_inode);
+
+		if (!extent_info || extent_info->update_times) {
+			inode->vfs_inode.i_mtime = current_time(&inode->vfs_inode);
+			inode->vfs_inode.i_ctime = inode->vfs_inode.i_mtime;
+		}
+
 		ret = btrfs_update_inode(trans, root, inode);
 		if (ret)
 			break;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ede3e873c2a..ab4ebcb7878c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9907,6 +9907,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
 	extent_info.file_offset = file_offset;
 	extent_info.extent_buf = (char *)&stack_fi;
 	extent_info.is_new_extent = true;
+	extent_info.update_times = true;
 	extent_info.qgroup_reserved = qgroup_released;
 	extent_info.insertions = 0;
 
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 7e3b0aa318c1..977e0d218d79 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -497,6 +497,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 			clone_info.file_offset = new_key.offset;
 			clone_info.extent_buf = buf;
 			clone_info.is_new_extent = false;
+			clone_info.update_times = !no_time_update;
 			ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
 					drop_start, new_key.offset + datal - 1,
 					&clone_info, &trans);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space when replacing extents
  2022-06-06  9:41 [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate fdmanana
  2022-06-06  9:41 ` [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion fdmanana
  2022-06-06  9:41 ` [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents fdmanana
@ 2022-06-06  9:41 ` fdmanana
  2022-06-07 16:44   ` Boris Burkov
  2022-06-06 20:45 ` [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate David Sterba
  3 siblings, 1 reply; 10+ messages in thread
From: fdmanana @ 2022-06-06  9:41 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At btrfs_replace_file_extents(), if we fail to migrate reserved metadata
space from the transaction block reserve into the local block reserve,
we trigger a BUG_ON(). This is because it should not be possible to have
a failure here, as we reserved more space when we started the transaction
than the space we want to migrate. However having a BUG_ON() is way too
drastic, we can perfectly handle the failure and return the error to the
caller. So just do that instead, and add a WARN_ON() to make it easier
to notice the failure if it evers happens (which is particularly useful
for fstests, and the warning will trigger a failure of a test case).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 29de433b7804..da41a0c371bc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2719,7 +2719,8 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
 
 	ret = btrfs_block_rsv_migrate(&fs_info->trans_block_rsv, rsv,
 				      min_size, false);
-	BUG_ON(ret);
+	if (WARN_ON(ret))
+		goto out_trans;
 	trans->block_rsv = rsv;
 
 	cur_offset = start;
@@ -2838,7 +2839,8 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
 
 		ret = btrfs_block_rsv_migrate(&fs_info->trans_block_rsv,
 					      rsv, min_size, false);
-		BUG_ON(ret);	/* shouldn't happen */
+		if (WARN_ON(ret))
+			break;
 		trans->block_rsv = rsv;
 
 		cur_offset = drop_args.drop_end;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate
  2022-06-06  9:41 [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate fdmanana
                   ` (2 preceding siblings ...)
  2022-06-06  9:41 ` [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space " fdmanana
@ 2022-06-06 20:45 ` David Sterba
  3 siblings, 0 replies; 10+ messages in thread
From: David Sterba @ 2022-06-06 20:45 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Jun 06, 2022 at 10:41:16AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> This patchset fixes a couple of bugs, the first one exclusive for reflink
> operations while the second one applies to reflink, fallocate and hole
> punching operations. Finally the third patch just removes some BUG_ON()s
> that are really not needed.
> All the patches are independent of each other, they could have been sent
> out as individual patches.
> 
> Filipe Manana (3):
>   btrfs: fix race between reflinking and ordered extent completion
>   btrfs: add missing inode updates on each iteration when replacing
>     extents
>   btrfs: do not BUG_ON() on failure to migrate space when replacing
>     extents

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion
  2022-06-06  9:41 ` [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion fdmanana
@ 2022-06-06 21:36   ` Boris Burkov
  0 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2022-06-06 21:36 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Jun 06, 2022 at 10:41:17AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> While doing a reflink operation, if an ordered extent for a file range
> that does not overlap with the source and destination ranges of the
> reflink operation happens, we can end up having a failure in the reflink
> operation and return -EINVAL to user space.
> 
> The following sequence of steps explains how this can happen:
> 
> 1) We have the page at file offset 315392 dirty (under delalloc);
> 
> 2) A reflink operation for this file starts, using the same file as both
>    source and destination, the source range is [372736, 409600[ (length of
>    36864 bytes) and the destination range is [208896, 245760[;
> 
> 3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
>    and destination ranges, and wait for any ordered extents in those range
>    to complete;
> 
> 4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
>    the inode, but we neither wait for it to complete nor any ordered
>    extents to complete. This results in starting delalloc for the page at
>    file offset 315392 and creating an ordered extent for that single page
>    range;
> 
> 5) We then move to btrfs_clone() and enter the loop to find file extent
>    items to copy from the source range to destination range;
> 
> 6) In the first iteration we end up at last file extent item stored in
>    leaf A:
> 
>    (...)
>    item 131 key (143616 108 315392) itemoff 5101 itemsize 53
>             extent data disk bytenr 1903988736 nr 73728
>             extent data offset 12288 nr 61440 ram 73728
> 
>    This represents the file range [315392, 376832[, which overlaps with
>    the source range to clone.
> 
>    @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
>    is therefore set to 376832 (315392 + 61440).
> 
>    @off (372736) is > key.offset (315392), so @new_key.offset is set to
>    the value of @destoff (208896).
> 
>    @new_key.offset == @last_dest_end (208896) so @drop_start is set to
>    208896 (@new_key.offset).
> 
>    @datal is adjusted to 4096, as @off is > @key.offset.
> 
>    So in this iteration we call btrfs_replace_file_extents() for the range
>    [208896, 212991] (a single page, which is
>    [@drop_start, @new_key.offset + @datal - 1]).
> 
>    @last_dest_end is set to 212992 (@new_key.offset + @datal =
>    208896 + 4096 = 212992).
> 
>    Before the next iteration of the loop, @key.offset is set to the value
>    376832, which is @next_key_min_offset;
> 
> 7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
>    but this time pointing beyond the last slot of leaf A, as that's where
>    a key with offset 376832 should be at if it existed. So end up calling
>    btrfs_next_leaf();
> 
> 8) btrfs_next_leaf() releases the path, but before it searches again the
>    tree for the next key/leaf, the ordered extent for the single page
>    range at file offset 315392 completes. That results in trimming the
>    file extent item we processed before, adjusting its key offset from
>    315392 to 319488, reducing its length from 61440 to 57344 and inserting
>    a new file extent item for that single page range, with a key offset of
>    315392 and a length of 4096.
> 
>    Leaf A now looks like:
> 
>      (...)
>      item 132 key (143616 108 315392) itemoff 4995 itemsize 53
>               extent data disk bytenr 1801666560 nr 4096
>               extent data offset 0 nr 4096 ram 4096
>      item 133 key (143616 108 319488) itemoff 4942 itemsize 53
>               extent data disk bytenr 1903988736 nr 73728
>               extent data offset 16384 nr 57344 ram 73728
> 
> 9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
>    at slot 133, since it's the first key that follows what was the last
>    key we saw (143616 108 315392). In fact it's the same item we processed
>    before, but its key offset was changed, so it counts as a new key;
> 
> 10) So now we have:
> 
>     @key.offset == 319488
>     @datal == 57344
> 
>     @off (372736) is > key.offset (319488), so @new_key.offset is set to
>     208896 (@destoff value).
> 
>     @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
>     is set to 212992 (@last_dest_end value).
> 
>     @datal is adjusted to 4096 because @off > @key.offset.
> 
>     So in this iteration we call btrfs_replace_file_extents() for the
>     invalid range of [212992, 212991] (which is
>     [@drop_start, @new_key.offset + @datal - 1]).
> 
>     This range is empty, the end offset is smaller than the start offset
>     so btrfs_replace_file_extents() returns -EINVAL, which we end up
>     returning to user space and fail the reflink operation.
> 
>     This all happens because the range of this file extent item was
>     already processed in the previous iteration.
> 
> This scenario can be triggered very sporadically by fsx from fstests, for
> example with test case generic/522.
> 
> So fix this by having btrfs_clone() skip file extent items that cover a
> file range that we have already processed.
> 
> CC: stable@vger.kernel.org # 5.10+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Error description and fix make sense, and work on my setup.

I sort of wish it were possible to wait for any io touching extents that
overlap the range in prep, not just for the live ordered_extents that 
overlap the range, but I can't think of how to do it, and skipping
redundant extents is a reasonable fix anyway.

Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  fs/btrfs/reflink.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index c0f1456be998..7e3b0aa318c1 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -345,6 +345,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  	int ret;
>  	const u64 len = olen_aligned;
>  	u64 last_dest_end = destoff;
> +	u64 prev_extent_end = off;
>  
>  	ret = -ENOMEM;
>  	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
> @@ -364,7 +365,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  	key.offset = off;
>  
>  	while (1) {
> -		u64 next_key_min_offset = key.offset + 1;
>  		struct btrfs_file_extent_item *extent;
>  		u64 extent_gen;
>  		int type;
> @@ -432,14 +432,21 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  		 * The first search might have left us at an extent item that
>  		 * ends before our target range's start, can happen if we have
>  		 * holes and NO_HOLES feature enabled.
> +		 *
> +		 * Subsequent searches may leave us on a file range we have
> +		 * processed before - this happens due to a race with ordered
> +		 * extent completion for a file range that is outside our source
> +		 * range, but that range was part of a file extent item that
> +		 * also covered a leading part of our source range.
>  		 */
> -		if (key.offset + datal <= off) {
> +		if (key.offset + datal <= prev_extent_end) {
>  			path->slots[0]++;
>  			goto process_slot;
>  		} else if (key.offset >= off + len) {
>  			break;
>  		}
> -		next_key_min_offset = key.offset + datal;
> +
> +		prev_extent_end = key.offset + datal;
>  		size = btrfs_item_size(leaf, slot);
>  		read_extent_buffer(leaf, buf, btrfs_item_ptr_offset(leaf, slot),
>  				   size);
> @@ -551,7 +558,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  			break;
>  
>  		btrfs_release_path(path);
> -		key.offset = next_key_min_offset;
> +		key.offset = prev_extent_end;
>  
>  		if (fatal_signal_pending(current)) {
>  			ret = -EINTR;
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents
  2022-06-06  9:41 ` [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents fdmanana
@ 2022-06-06 22:11   ` Boris Burkov
  2022-06-07  9:31     ` Filipe Manana
  0 siblings, 1 reply; 10+ messages in thread
From: Boris Burkov @ 2022-06-06 22:11 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Jun 06, 2022 at 10:41:18AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When replacing file extents, called during fallocate, hole punching,
> clone and deduplication, we may not be able to replace/drop all the
> target file extent items with a single transaction handle. We may get
> -ENOSPC while doing it, in which case we release the transaction handle,
> balance the dirty pages of the btree inode, flush delayed items and get
> a new transaction handle to operate on what's left of the target range.
> 
> By dropping and replacing file extent items we have effectively modified

How can you be sure that you definitely modified it? Is it possible for
btrfs_drop_extents to return ENOSPC without dropping extents?

> the inode, so we should bump its iversion and update its mtime/ctime
> before we update the inode item. This is because if the transaction
> we used for partially modifying the inode gets committed by someone after
> we release it and before we finish the rest of the range, a power failure
> happens, then after mounting the filesystem our inode has an outdated
> iversion and mtime/ctime, corresponding to the values it had before we
> changed it.
> 
> So add the missing iversion and mtime/ctime updates.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ctree.h   |  2 ++
>  fs/btrfs/file.c    | 19 +++++++++++++++++++
>  fs/btrfs/inode.c   |  1 +
>  fs/btrfs/reflink.c |  1 +
>  4 files changed, 23 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 55dee1564e90..737cd59d16b6 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1330,6 +1330,8 @@ struct btrfs_replace_extent_info {
>  	 * existing extent into a file range.
>  	 */
>  	bool is_new_extent;
> +	/* Indicate if we should update the inode's mtime and ctime. */
> +	bool update_times;
>  	/* Meaningful only if is_new_extent is true. */
>  	int qgroup_reserved;
>  	/*
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 1fd827b99c1b..29de433b7804 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2803,6 +2803,25 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
>  			extent_info->file_offset += replace_len;
>  		}
>  
> +		/*
> +		 * We are releasing our handle on the transaction, balance the
> +		 * dirty pages of the btree inode and flush delayed items, and
> +		 * then get a new transaction handle, which may now point to a
> +		 * new transaction in case someone else may have committed the
> +		 * transaction we used to replace/drop file extent items. So
> +		 * bump the inode's iversion and update mtime and ctime except
> +		 * if we are called from a dedupe context. This is because a
> +		 * power failure/crash may happen after the transaction is
> +		 * committed and before we finish replacing/dropping all the
> +		 * file extent items we need.
> +		 */
> +		inode_inc_iversion(&inode->vfs_inode);
> +
> +		if (!extent_info || extent_info->update_times) {
> +			inode->vfs_inode.i_mtime = current_time(&inode->vfs_inode);
> +			inode->vfs_inode.i_ctime = inode->vfs_inode.i_mtime;
> +		}
> +
>  		ret = btrfs_update_inode(trans, root, inode);
>  		if (ret)
>  			break;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3ede3e873c2a..ab4ebcb7878c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9907,6 +9907,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  	extent_info.file_offset = file_offset;
>  	extent_info.extent_buf = (char *)&stack_fi;
>  	extent_info.is_new_extent = true;
> +	extent_info.update_times = true;
>  	extent_info.qgroup_reserved = qgroup_released;
>  	extent_info.insertions = 0;
>  
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index 7e3b0aa318c1..977e0d218d79 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -497,6 +497,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  			clone_info.file_offset = new_key.offset;
>  			clone_info.extent_buf = buf;
>  			clone_info.is_new_extent = false;
> +			clone_info.update_times = !no_time_update;
>  			ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
>  					drop_start, new_key.offset + datal - 1,
>  					&clone_info, &trans);
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents
  2022-06-06 22:11   ` Boris Burkov
@ 2022-06-07  9:31     ` Filipe Manana
  2022-06-07 16:41       ` Boris Burkov
  0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2022-06-07  9:31 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On Mon, Jun 06, 2022 at 03:11:52PM -0700, Boris Burkov wrote:
> On Mon, Jun 06, 2022 at 10:41:18AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > When replacing file extents, called during fallocate, hole punching,
> > clone and deduplication, we may not be able to replace/drop all the
> > target file extent items with a single transaction handle. We may get
> > -ENOSPC while doing it, in which case we release the transaction handle,
> > balance the dirty pages of the btree inode, flush delayed items and get
> > a new transaction handle to operate on what's left of the target range.
> > 
> > By dropping and replacing file extent items we have effectively modified
> 
> How can you be sure that you definitely modified it? Is it possible for
> btrfs_drop_extents to return ENOSPC without dropping extents?

If btrfs_drop_extents() fails with -ENOSPC, it means it tried to modify
more than one leaf. Since we reserved enough space for modifying one leaf,
it can only fail if it tries to modify more leaves, and if it tries to do
so, it means it dropped or trimmed file extent items from a leaf already.

> 
> > the inode, so we should bump its iversion and update its mtime/ctime
> > before we update the inode item. This is because if the transaction
> > we used for partially modifying the inode gets committed by someone after
> > we release it and before we finish the rest of the range, a power failure
> > happens, then after mounting the filesystem our inode has an outdated
> > iversion and mtime/ctime, corresponding to the values it had before we
> > changed it.
> > 
> > So add the missing iversion and mtime/ctime updates.
> > 
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/ctree.h   |  2 ++
> >  fs/btrfs/file.c    | 19 +++++++++++++++++++
> >  fs/btrfs/inode.c   |  1 +
> >  fs/btrfs/reflink.c |  1 +
> >  4 files changed, 23 insertions(+)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 55dee1564e90..737cd59d16b6 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1330,6 +1330,8 @@ struct btrfs_replace_extent_info {
> >  	 * existing extent into a file range.
> >  	 */
> >  	bool is_new_extent;
> > +	/* Indicate if we should update the inode's mtime and ctime. */
> > +	bool update_times;
> >  	/* Meaningful only if is_new_extent is true. */
> >  	int qgroup_reserved;
> >  	/*
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 1fd827b99c1b..29de433b7804 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2803,6 +2803,25 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
> >  			extent_info->file_offset += replace_len;
> >  		}
> >  
> > +		/*
> > +		 * We are releasing our handle on the transaction, balance the
> > +		 * dirty pages of the btree inode and flush delayed items, and
> > +		 * then get a new transaction handle, which may now point to a
> > +		 * new transaction in case someone else may have committed the
> > +		 * transaction we used to replace/drop file extent items. So
> > +		 * bump the inode's iversion and update mtime and ctime except
> > +		 * if we are called from a dedupe context. This is because a
> > +		 * power failure/crash may happen after the transaction is
> > +		 * committed and before we finish replacing/dropping all the
> > +		 * file extent items we need.
> > +		 */
> > +		inode_inc_iversion(&inode->vfs_inode);
> > +
> > +		if (!extent_info || extent_info->update_times) {
> > +			inode->vfs_inode.i_mtime = current_time(&inode->vfs_inode);
> > +			inode->vfs_inode.i_ctime = inode->vfs_inode.i_mtime;
> > +		}
> > +
> >  		ret = btrfs_update_inode(trans, root, inode);
> >  		if (ret)
> >  			break;
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 3ede3e873c2a..ab4ebcb7878c 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -9907,6 +9907,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
> >  	extent_info.file_offset = file_offset;
> >  	extent_info.extent_buf = (char *)&stack_fi;
> >  	extent_info.is_new_extent = true;
> > +	extent_info.update_times = true;
> >  	extent_info.qgroup_reserved = qgroup_released;
> >  	extent_info.insertions = 0;
> >  
> > diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> > index 7e3b0aa318c1..977e0d218d79 100644
> > --- a/fs/btrfs/reflink.c
> > +++ b/fs/btrfs/reflink.c
> > @@ -497,6 +497,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
> >  			clone_info.file_offset = new_key.offset;
> >  			clone_info.extent_buf = buf;
> >  			clone_info.is_new_extent = false;
> > +			clone_info.update_times = !no_time_update;
> >  			ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
> >  					drop_start, new_key.offset + datal - 1,
> >  					&clone_info, &trans);
> > -- 
> > 2.35.1
> > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents
  2022-06-07  9:31     ` Filipe Manana
@ 2022-06-07 16:41       ` Boris Burkov
  0 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2022-06-07 16:41 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Tue, Jun 07, 2022 at 10:31:39AM +0100, Filipe Manana wrote:
> On Mon, Jun 06, 2022 at 03:11:52PM -0700, Boris Burkov wrote:
> > On Mon, Jun 06, 2022 at 10:41:18AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > > 
> > > When replacing file extents, called during fallocate, hole punching,
> > > clone and deduplication, we may not be able to replace/drop all the
> > > target file extent items with a single transaction handle. We may get
> > > -ENOSPC while doing it, in which case we release the transaction handle,
> > > balance the dirty pages of the btree inode, flush delayed items and get
> > > a new transaction handle to operate on what's left of the target range.
> > > 
> > > By dropping and replacing file extent items we have effectively modified
> > 
> > How can you be sure that you definitely modified it? Is it possible for
> > btrfs_drop_extents to return ENOSPC without dropping extents?
> 
> If btrfs_drop_extents() fails with -ENOSPC, it means it tried to modify
> more than one leaf. Since we reserved enough space for modifying one leaf,
> it can only fail if it tries to modify more leaves, and if it tries to do
> so, it means it dropped or trimmed file extent items from a leaf already.
> 
> > 
> > > the inode, so we should bump its iversion and update its mtime/ctime
> > > before we update the inode item. This is because if the transaction
> > > we used for partially modifying the inode gets committed by someone after
> > > we release it and before we finish the rest of the range, a power failure
> > > happens, then after mounting the filesystem our inode has an outdated
> > > iversion and mtime/ctime, corresponding to the values it had before we
> > > changed it.
> > > 
> > > So add the missing iversion and mtime/ctime updates.
> > > 
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Boris Burkov <boris@bur.io>

> > > ---
> > >  fs/btrfs/ctree.h   |  2 ++
> > >  fs/btrfs/file.c    | 19 +++++++++++++++++++
> > >  fs/btrfs/inode.c   |  1 +
> > >  fs/btrfs/reflink.c |  1 +
> > >  4 files changed, 23 insertions(+)
> > > 
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index 55dee1564e90..737cd59d16b6 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -1330,6 +1330,8 @@ struct btrfs_replace_extent_info {
> > >  	 * existing extent into a file range.
> > >  	 */
> > >  	bool is_new_extent;
> > > +	/* Indicate if we should update the inode's mtime and ctime. */
> > > +	bool update_times;
> > >  	/* Meaningful only if is_new_extent is true. */
> > >  	int qgroup_reserved;
> > >  	/*
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index 1fd827b99c1b..29de433b7804 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -2803,6 +2803,25 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
> > >  			extent_info->file_offset += replace_len;
> > >  		}
> > >  
> > > +		/*
> > > +		 * We are releasing our handle on the transaction, balance the
> > > +		 * dirty pages of the btree inode and flush delayed items, and
> > > +		 * then get a new transaction handle, which may now point to a
> > > +		 * new transaction in case someone else may have committed the
> > > +		 * transaction we used to replace/drop file extent items. So
> > > +		 * bump the inode's iversion and update mtime and ctime except
> > > +		 * if we are called from a dedupe context. This is because a
> > > +		 * power failure/crash may happen after the transaction is
> > > +		 * committed and before we finish replacing/dropping all the
> > > +		 * file extent items we need.
> > > +		 */
> > > +		inode_inc_iversion(&inode->vfs_inode);
> > > +
> > > +		if (!extent_info || extent_info->update_times) {
> > > +			inode->vfs_inode.i_mtime = current_time(&inode->vfs_inode);
> > > +			inode->vfs_inode.i_ctime = inode->vfs_inode.i_mtime;
> > > +		}
> > > +
> > >  		ret = btrfs_update_inode(trans, root, inode);
> > >  		if (ret)
> > >  			break;
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 3ede3e873c2a..ab4ebcb7878c 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -9907,6 +9907,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
> > >  	extent_info.file_offset = file_offset;
> > >  	extent_info.extent_buf = (char *)&stack_fi;
> > >  	extent_info.is_new_extent = true;
> > > +	extent_info.update_times = true;
> > >  	extent_info.qgroup_reserved = qgroup_released;
> > >  	extent_info.insertions = 0;
> > >  
> > > diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> > > index 7e3b0aa318c1..977e0d218d79 100644
> > > --- a/fs/btrfs/reflink.c
> > > +++ b/fs/btrfs/reflink.c
> > > @@ -497,6 +497,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
> > >  			clone_info.file_offset = new_key.offset;
> > >  			clone_info.extent_buf = buf;
> > >  			clone_info.is_new_extent = false;
> > > +			clone_info.update_times = !no_time_update;
> > >  			ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
> > >  					drop_start, new_key.offset + datal - 1,
> > >  					&clone_info, &trans);
> > > -- 
> > > 2.35.1
> > > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space when replacing extents
  2022-06-06  9:41 ` [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space " fdmanana
@ 2022-06-07 16:44   ` Boris Burkov
  0 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2022-06-07 16:44 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Jun 06, 2022 at 10:41:19AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> At btrfs_replace_file_extents(), if we fail to migrate reserved metadata
> space from the transaction block reserve into the local block reserve,
> we trigger a BUG_ON(). This is because it should not be possible to have
> a failure here, as we reserved more space when we started the transaction
> than the space we want to migrate. However having a BUG_ON() is way too
> drastic, we can perfectly handle the failure and return the error to the
> caller. So just do that instead, and add a WARN_ON() to make it easier
> to notice the failure if it evers happens (which is particularly useful
> for fstests, and the warning will trigger a failure of a test case).
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/file.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 29de433b7804..da41a0c371bc 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2719,7 +2719,8 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
>  
>  	ret = btrfs_block_rsv_migrate(&fs_info->trans_block_rsv, rsv,
>  				      min_size, false);
> -	BUG_ON(ret);
> +	if (WARN_ON(ret))
> +		goto out_trans;
>  	trans->block_rsv = rsv;
>  
>  	cur_offset = start;
> @@ -2838,7 +2839,8 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
>  
>  		ret = btrfs_block_rsv_migrate(&fs_info->trans_block_rsv,
>  					      rsv, min_size, false);
> -		BUG_ON(ret);	/* shouldn't happen */
> +		if (WARN_ON(ret))
> +			break;
>  		trans->block_rsv = rsv;
>  
>  		cur_offset = drop_args.drop_end;
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-06-07 16:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-06  9:41 [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate fdmanana
2022-06-06  9:41 ` [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion fdmanana
2022-06-06 21:36   ` Boris Burkov
2022-06-06  9:41 ` [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents fdmanana
2022-06-06 22:11   ` Boris Burkov
2022-06-07  9:31     ` Filipe Manana
2022-06-07 16:41       ` Boris Burkov
2022-06-06  9:41 ` [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space " fdmanana
2022-06-07 16:44   ` Boris Burkov
2022-06-06 20:45 ` [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.