All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] btrfs: some fiemap fixes
@ 2024-02-23  8:08 fdmanana
  2024-02-23  8:08 ` [PATCH 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23  8:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's a recent regression with fiemap due to a fix for a deadlock between
fiemap and memory mapped writes when the fiemap buffer is memory mapped to
the same file range, which leads to a race triggering a warning and making
fiemap fail. Plus one more long standing race when using FIEMAP_FLAG_SYNC.
Details in the change logs.

Filipe Manana (2):
  btrfs: fix race between ordered extent completion and fiemap
  btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given

 fs/btrfs/extent_io.c | 50 +++++++++++++++++++++++++++-----------------
 fs/btrfs/inode.c     | 22 ++++++++++++++++++-
 2 files changed, 52 insertions(+), 20 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] btrfs: fix race between ordered extent completion and fiemap
  2024-02-23  8:08 [PATCH 0/2] btrfs: some fiemap fixes fdmanana
@ 2024-02-23  8:08 ` fdmanana
  2024-02-23  8:08 ` [PATCH 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23  8:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:

1) The fiemap task finishes processing a file extent item that covers
   the file range [512K, 1M[, and that file extent item is the last item
   in the leaf currently being processed;

2) And ordered extent for the file range [768K, 2M[, in COW mode,
   completes (btrfs_finish_one_ordered()) and the file extent item
   covering the range [512K, 1M[ is trimmed to cover the range
   [512K, 768K[ and then a new file extent item for the range [768K, 2M[
   is inserted in the inode's subvolume tree;

3) The fiemap task calls fiemap_next_leaf_item(), which then calls
   btrfs_next_leaf() to find the next leaf / item. This finds that the
   the next key following the one we previously processed (its type is
   BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
   to the new file extent item inserted by the ordered extent, which has
   a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;

4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
   the warning:

      if (cache->offset + cache->len > offset) {
               WARN_ON(1);
               return -EINVAL;
      }

   Since we get 1M > 768K, because the previously emitted entry for the
   old extent covering the file range [512K, 1M[ ends at an offset that
   is greater than the new extent's start offset (768K). This make fiemap
   fail with -EINVAL besides triggering the warning that produces a stack
   trace like the following:

     [ 1621.677651] ------------[ cut here ]------------
     [ 1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.677899] Modules linked in: btrfs blake2b_generic (...)
     [ 1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
     [ 1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
     [ 1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678033] Code: 2b 4c 89 63 (...)
     [ 1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
     [ 1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
     [ 1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
     [ 1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
     [ 1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
     [ 1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
     [ 1621.678044] FS:  00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
     [ 1621.678046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     [ 1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
     [ 1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     [ 1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     [ 1621.678056] Call Trace:
     [ 1621.678074]  <TASK>
     [ 1621.678076]  ? __warn+0x80/0x130
     [ 1621.678082]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678159]  ? report_bug+0x1f4/0x200
     [ 1621.678164]  ? handle_bug+0x42/0x70
     [ 1621.678167]  ? exc_invalid_op+0x14/0x70
     [ 1621.678170]  ? asm_exc_invalid_op+0x16/0x20
     [ 1621.678178]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678253]  extent_fiemap+0x766/0xa30 [btrfs]
     [ 1621.678339]  btrfs_fiemap+0x45/0x80 [btrfs]
     [ 1621.678420]  do_vfs_ioctl+0x1e4/0x870
     [ 1621.678431]  __x64_sys_ioctl+0x6a/0xc0
     [ 1621.678434]  do_syscall_64+0x52/0x120
     [ 1621.678445]  entry_SYSCALL_64_after_hwframe+0x6e/0x76

So to fix this by changing emit_fiemap_extent() to adjust the length of
the previously cached extent so that it does not overlap with the current
extent, and emit the previous one. This makes everything consistent by
reflecting the current state of file extent items in the btree and
without emitting extents that have overlapping ranges.

Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e2dbadfc082f..eab23cb2bb93 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2482,15 +2482,31 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 		goto assign;
 
 	/*
-	 * Sanity check, extent_fiemap() should have ensured that new
-	 * fiemap extent won't overlap with cached one.
-	 * Not recoverable.
+	 * When iterating the extents of the inode, at extent_fiemap(), we may
+	 * find an extent that starts at an offset behind the end offset of the
+	 * previous extent we processed. This happens if fiemap is called
+	 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
+	 * while we call btrfs_next_leaf() (through fiemap_next_leaf_item()).
 	 *
-	 * NOTE: Physical address can overlap, due to compression
+	 * For example we are in leaf X processing its last item, which is the
+	 * file extent item for file range [512K, 1M[, and after
+	 * btrfs_next_leaf() releases the path, there's an ordered extent that
+	 * completes for the file range [768K, 2M[, and that results in trimming
+	 * the file extent item so that it now corresponds to the file range
+	 * [512K, 768K[ and a new file extent item is inserted for the file
+	 * range [768K, 2M[, which may end up as the last item of leaf X or as
+	 * the first item of the next leaf - in either case btrfs_next_leaf()
+	 * will leave us with a path pointing to the new extent item, for the
+	 * file range [768K, 2M[, since that's the first key that follows the
+	 * last one we processed. So in order not to report overlapping extents
+	 * to user space, we trim the length of the previously cached extent and
+	 * emit it.
 	 */
 	if (cache->offset + cache->len > offset) {
-		WARN_ON(1);
-		return -EINVAL;
+		if (WARN_ON(offset <= cache->offset))
+			return -EINVAL;
+		cache->len = offset - cache->offset;
+		goto emit;
 	}
 
 	/*
@@ -2510,6 +2526,7 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 		return 0;
 	}
 
+emit:
 	/* Not mergeable, need to submit cached one */
 	ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
 				      cache->len, cache->flags);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
  2024-02-23  8:08 [PATCH 0/2] btrfs: some fiemap fixes fdmanana
  2024-02-23  8:08 ` [PATCH 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
@ 2024-02-23  8:08 ` fdmanana
  2024-02-23 15:19 ` [PATCH v2 0/2] btrfs: some fiemap fixes fdmanana
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23  8:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.

When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 21 ++++++++-------------
 fs/btrfs/inode.c     | 22 +++++++++++++++++++++-
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index eab23cb2bb93..a4f6a3656862 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2917,17 +2917,15 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	range_end = round_up(start + len, sectorsize);
 	prev_extent_end = range_start;
 
-	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-
 	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
 	if (ret < 0)
-		goto out_unlock;
+		goto out;
 	btrfs_release_path(path);
 
 	path->reada = READA_FORWARD;
 	ret = fiemap_search_slot(inode, path, range_start);
 	if (ret < 0) {
-		goto out_unlock;
+		goto out;
 	} else if (ret > 0) {
 		/*
 		 * No file extent item found, but we may have delalloc between
@@ -2974,7 +2972,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 						  backref_ctx, 0, 0, 0,
 						  prev_extent_end, hole_end);
 			if (ret < 0) {
-				goto out_unlock;
+				goto out;
 			} else if (ret > 0) {
 				/* fiemap_fill_next_extent() told us to stop. */
 				stopped = true;
@@ -3030,7 +3028,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 								  extent_gen,
 								  backref_ctx);
 				if (ret < 0)
-					goto out_unlock;
+					goto out;
 				else if (ret > 0)
 					flags |= FIEMAP_EXTENT_SHARED;
 			}
@@ -3041,7 +3039,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		}
 
 		if (ret < 0) {
-			goto out_unlock;
+			goto out;
 		} else if (ret > 0) {
 			/* fiemap_fill_next_extent() told us to stop. */
 			stopped = true;
@@ -3052,12 +3050,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 next_item:
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
-			goto out_unlock;
+			goto out;
 		}
 
 		ret = fiemap_next_leaf_item(inode, path);
 		if (ret < 0) {
-			goto out_unlock;
+			goto out;
 		} else if (ret > 0) {
 			/* No more file extent items for this inode. */
 			break;
@@ -3081,7 +3079,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 					  &delalloc_cached_state, backref_ctx,
 					  0, 0, 0, prev_extent_end, range_end - 1);
 		if (ret < 0)
-			goto out_unlock;
+			goto out;
 		prev_extent_end = range_end;
 	}
 
@@ -3119,9 +3117,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	}
 
 	ret = emit_last_fiemap_cache(fieinfo, &cache);
-
-out_unlock:
-	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 out:
 	free_extent_state(delalloc_cached_state);
 	btrfs_free_backref_share_ctx(backref_ctx);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f93cb23ae1ee..cb23b3834c3d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7869,6 +7869,7 @@ struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			u64 start, u64 len)
 {
+	struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
 	int	ret;
 
 	ret = fiemap_prep(inode, fieinfo, start, &len, 0);
@@ -7894,7 +7895,26 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			return ret;
 	}
 
-	return extent_fiemap(BTRFS_I(inode), fieinfo, start, len);
+	btrfs_inode_lock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	/*
+	 * We did an initial flush to avoid holding the inode's lock while
+	 * triggering writeback and waiting for the completion of IO and ordered
+	 * extents. Now after we locked the inode we do it again, because it's
+	 * possible a new write may have happened in between those two steps.
+	 */
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
+		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
+		if (ret) {
+			btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+			return ret;
+		}
+	}
+
+	ret = extent_fiemap(btrfs_inode, fieinfo, start, len);
+	btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	return ret;
 }
 
 static int btrfs_writepages(struct address_space *mapping,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 0/2] btrfs: some fiemap fixes
  2024-02-23  8:08 [PATCH 0/2] btrfs: some fiemap fixes fdmanana
  2024-02-23  8:08 ` [PATCH 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
  2024-02-23  8:08 ` [PATCH 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
@ 2024-02-23 15:19 ` fdmanana
  2024-02-23 15:19   ` [PATCH v2 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
  2024-02-23 15:19   ` [PATCH v2 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
  2024-02-23 23:47 ` [PATCH v3 0/2] btrfs: some fiemap fixes fdmanana
  2024-02-25 19:51 ` [PATCH v4 0/2] btrfs: some fiemap fixes fdmanana
  4 siblings, 2 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23 15:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's a recent regression with fiemap due to a fix for a deadlock between
fiemap and memory mapped writes when the fiemap buffer is memory mapped to
the same file range, which leads to a race triggering a warning and making
fiemap fail. Plus one more long standing race when using FIEMAP_FLAG_SYNC.
Details in the change logs.

V2: Updated patch 1/2 to deal with the case of a hole/prealloc extent
    with multiple delalloc ranges inside it.

Filipe Manana (2):
  btrfs: fix race between ordered extent completion and fiemap
  btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given

 fs/btrfs/extent_io.c | 60 +++++++++++++++++++++++++++++++-------------
 fs/btrfs/inode.c     | 22 +++++++++++++++-
 2 files changed, 64 insertions(+), 18 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] btrfs: fix race between ordered extent completion and fiemap
  2024-02-23 15:19 ` [PATCH v2 0/2] btrfs: some fiemap fixes fdmanana
@ 2024-02-23 15:19   ` fdmanana
  2024-02-23 15:19   ` [PATCH v2 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
  1 sibling, 0 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23 15:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:

1) The fiemap task finishes processing a file extent item that covers
   the file range [512K, 1M[, and that file extent item is the last item
   in the leaf currently being processed;

2) And ordered extent for the file range [768K, 2M[, in COW mode,
   completes (btrfs_finish_one_ordered()) and the file extent item
   covering the range [512K, 1M[ is trimmed to cover the range
   [512K, 768K[ and then a new file extent item for the range [768K, 2M[
   is inserted in the inode's subvolume tree;

3) The fiemap task calls fiemap_next_leaf_item(), which then calls
   btrfs_next_leaf() to find the next leaf / item. This finds that the
   the next key following the one we previously processed (its type is
   BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
   to the new file extent item inserted by the ordered extent, which has
   a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;

4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
   the warning:

      if (cache->offset + cache->len > offset) {
               WARN_ON(1);
               return -EINVAL;
      }

   Since we get 1M > 768K, because the previously emitted entry for the
   old extent covering the file range [512K, 1M[ ends at an offset that
   is greater than the new extent's start offset (768K). This make fiemap
   fail with -EINVAL besides triggering the warning that produces a stack
   trace like the following:

     [ 1621.677651] ------------[ cut here ]------------
     [ 1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.677899] Modules linked in: btrfs blake2b_generic (...)
     [ 1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
     [ 1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
     [ 1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678033] Code: 2b 4c 89 63 (...)
     [ 1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
     [ 1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
     [ 1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
     [ 1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
     [ 1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
     [ 1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
     [ 1621.678044] FS:  00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
     [ 1621.678046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     [ 1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
     [ 1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     [ 1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     [ 1621.678056] Call Trace:
     [ 1621.678074]  <TASK>
     [ 1621.678076]  ? __warn+0x80/0x130
     [ 1621.678082]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678159]  ? report_bug+0x1f4/0x200
     [ 1621.678164]  ? handle_bug+0x42/0x70
     [ 1621.678167]  ? exc_invalid_op+0x14/0x70
     [ 1621.678170]  ? asm_exc_invalid_op+0x16/0x20
     [ 1621.678178]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678253]  extent_fiemap+0x766/0xa30 [btrfs]
     [ 1621.678339]  btrfs_fiemap+0x45/0x80 [btrfs]
     [ 1621.678420]  do_vfs_ioctl+0x1e4/0x870
     [ 1621.678431]  __x64_sys_ioctl+0x6a/0xc0
     [ 1621.678434]  do_syscall_64+0x52/0x120
     [ 1621.678445]  entry_SYSCALL_64_after_hwframe+0x6e/0x76

There's also another case where we before calling btrfs_next_leaf() we
processing a hole or a prealloc extent and we had several delalloc ranges
within that hole or prealloc extent. In that case if the ordered extents
complete before we find the next key, we may end up finding an extent item
with an offset than is smaller than offset we have in cache->offset.

So fix this by changing emit_fiemap_extent() to address these two
scenarios like this:

1) For the first case, steps listed above, adjust the length of the
   previously cached extent so that it does not overlap with the current
   extent, and emit the previous one. This makes everything consistent by
   reflecting the current state of file extent items in the btree and
   without emitting extents that have overlapping ranges;

2) For the second case where he had a hole or prealloc extent with
   multiple delalloc ranges inside the hole or prealloc extent's range,
   just discard what we have in the fiemap cache and assign the current
   file extent item to the cache. This also makes everything consistent
   by reflecting the current state of file extent items in the btree and
   without emitting extents that have overlapping ranges.

This issue could be triggered often with test case generic/561, and was
also hit and reported by Wang Yugui.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 43496a07ee42..4ae2076756f6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2487,13 +2487,43 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 		goto assign;
 
 	/*
-	 * Sanity check, extent_fiemap() should have ensured that new
-	 * fiemap extent won't overlap with cached one.
-	 * Not recoverable.
+	 * When iterating the extents of the inode, at extent_fiemap(), we may
+	 * find an extent that starts at an offset behind the end offset of the
+	 * previous extent we processed. This happens if fiemap is called
+	 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
+	 * while we call btrfs_next_leaf() (through fiemap_next_leaf_item()).
 	 *
-	 * NOTE: Physical address can overlap, due to compression
+	 * For example we are in leaf X processing its last item, which is the
+	 * file extent item for file range [512K, 1M[, and after
+	 * btrfs_next_leaf() releases the path, there's an ordered extent that
+	 * completes for the file range [768K, 2M[, and that results in trimming
+	 * the file extent item so that it now corresponds to the file range
+	 * [512K, 768K[ and a new file extent item is inserted for the file
+	 * range [768K, 2M[, which may end up as the last item of leaf X or as
+	 * the first item of the next leaf - in either case btrfs_next_leaf()
+	 * will leave us with a path pointing to the new extent item, for the
+	 * file range [768K, 2M[, since that's the first key that follows the
+	 * last one we processed. So in order not to report overlapping extents
+	 * to user space, we trim the length of the previously cached extent and
+	 * emit it.
+	 *
+	 * Upon calling btrfs_next_leaf() we may also find an extent with an
+	 * offset smaller than cache->offset, and this happens when we had a
+	 * hole or prealloc extent with several delalloc subranges, but after
+	 * btrfs_next_leaf() released the path, delalloc completed as well as
+	 * the ordered extents and we're now with a smaller offset. In this
+	 * case just discard what we have in the cache and start from this
+	 * current extent.
 	 */
 	if (cache->offset + cache->len > offset) {
+		if (offset < cache->offset) {
+			goto assign;
+		} else if (offset > cache->offset) {
+			cache->len = offset - cache->offset;
+			goto emit;
+		}
+
+		/* cache->offset == offset, should never happen. */
 		WARN_ON(1);
 		return -EINVAL;
 	}
@@ -2515,6 +2545,7 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 		return 0;
 	}
 
+emit:
 	/* Not mergeable, need to submit cached one */
 	ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
 				      cache->len, cache->flags);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
  2024-02-23 15:19 ` [PATCH v2 0/2] btrfs: some fiemap fixes fdmanana
  2024-02-23 15:19   ` [PATCH v2 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
@ 2024-02-23 15:19   ` fdmanana
  1 sibling, 0 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23 15:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.

When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 21 ++++++++-------------
 fs/btrfs/inode.c     | 22 +++++++++++++++++++++-
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4ae2076756f6..72b9b005e20a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2936,17 +2936,15 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	range_end = round_up(start + len, sectorsize);
 	prev_extent_end = range_start;
 
-	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-
 	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
 	if (ret < 0)
-		goto out_unlock;
+		goto out;
 	btrfs_release_path(path);
 
 	path->reada = READA_FORWARD;
 	ret = fiemap_search_slot(inode, path, range_start);
 	if (ret < 0) {
-		goto out_unlock;
+		goto out;
 	} else if (ret > 0) {
 		/*
 		 * No file extent item found, but we may have delalloc between
@@ -2993,7 +2991,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 						  backref_ctx, 0, 0, 0,
 						  prev_extent_end, hole_end);
 			if (ret < 0) {
-				goto out_unlock;
+				goto out;
 			} else if (ret > 0) {
 				/* fiemap_fill_next_extent() told us to stop. */
 				stopped = true;
@@ -3049,7 +3047,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 								  extent_gen,
 								  backref_ctx);
 				if (ret < 0)
-					goto out_unlock;
+					goto out;
 				else if (ret > 0)
 					flags |= FIEMAP_EXTENT_SHARED;
 			}
@@ -3060,7 +3058,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		}
 
 		if (ret < 0) {
-			goto out_unlock;
+			goto out;
 		} else if (ret > 0) {
 			/* fiemap_fill_next_extent() told us to stop. */
 			stopped = true;
@@ -3071,12 +3069,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 next_item:
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
-			goto out_unlock;
+			goto out;
 		}
 
 		ret = fiemap_next_leaf_item(inode, path);
 		if (ret < 0) {
-			goto out_unlock;
+			goto out;
 		} else if (ret > 0) {
 			/* No more file extent items for this inode. */
 			break;
@@ -3100,7 +3098,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 					  &delalloc_cached_state, backref_ctx,
 					  0, 0, 0, prev_extent_end, range_end - 1);
 		if (ret < 0)
-			goto out_unlock;
+			goto out;
 		prev_extent_end = range_end;
 	}
 
@@ -3138,9 +3136,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	}
 
 	ret = emit_last_fiemap_cache(fieinfo, &cache);
-
-out_unlock:
-	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 out:
 	free_extent_state(delalloc_cached_state);
 	btrfs_free_backref_share_ctx(backref_ctx);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d931cb40fb7a..904fff3d72f5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7865,6 +7865,7 @@ struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			u64 start, u64 len)
 {
+	struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
 	int	ret;
 
 	ret = fiemap_prep(inode, fieinfo, start, &len, 0);
@@ -7890,7 +7891,26 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			return ret;
 	}
 
-	return extent_fiemap(BTRFS_I(inode), fieinfo, start, len);
+	btrfs_inode_lock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	/*
+	 * We did an initial flush to avoid holding the inode's lock while
+	 * triggering writeback and waiting for the completion of IO and ordered
+	 * extents. Now after we locked the inode we do it again, because it's
+	 * possible a new write may have happened in between those two steps.
+	 */
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
+		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
+		if (ret) {
+			btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+			return ret;
+		}
+	}
+
+	ret = extent_fiemap(btrfs_inode, fieinfo, start, len);
+	btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	return ret;
 }
 
 static int btrfs_writepages(struct address_space *mapping,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 0/2] btrfs: some fiemap fixes
  2024-02-23  8:08 [PATCH 0/2] btrfs: some fiemap fixes fdmanana
                   ` (2 preceding siblings ...)
  2024-02-23 15:19 ` [PATCH v2 0/2] btrfs: some fiemap fixes fdmanana
@ 2024-02-23 23:47 ` fdmanana
  2024-02-23 23:47   ` [PATCH v3 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
  2024-02-23 23:47   ` [PATCH v3 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
  2024-02-25 19:51 ` [PATCH v4 0/2] btrfs: some fiemap fixes fdmanana
  4 siblings, 2 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23 23:47 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's a recent regression with fiemap due to a fix for a deadlock between
fiemap and memory mapped writes when the fiemap buffer is memory mapped to
the same file range, which leads to a race triggering a warning and making
fiemap fail. Plus one more long standing race when using FIEMAP_FLAG_SYNC.
Details in the change logs.

V3: Deal with the case where offset == cache->offset which is also
    possible if we had delalloc in the range of a hole or prealloc extent.

V2: Updated patch 1/2 to deal with the case of a hole/prealloc extent
    with multiple delalloc ranges inside it.

Filipe Manana (2):
  btrfs: fix race between ordered extent completion and fiemap
  btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given

 fs/btrfs/extent_io.c | 59 ++++++++++++++++++++++++++++++--------------
 fs/btrfs/inode.c     | 22 ++++++++++++++++-
 2 files changed, 61 insertions(+), 20 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/2] btrfs: fix race between ordered extent completion and fiemap
  2024-02-23 23:47 ` [PATCH v3 0/2] btrfs: some fiemap fixes fdmanana
@ 2024-02-23 23:47   ` fdmanana
  2024-02-23 23:47   ` [PATCH v3 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
  1 sibling, 0 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23 23:47 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:

1) The fiemap task finishes processing a file extent item that covers
   the file range [512K, 1M[, and that file extent item is the last item
   in the leaf currently being processed;

2) And ordered extent for the file range [768K, 2M[, in COW mode,
   completes (btrfs_finish_one_ordered()) and the file extent item
   covering the range [512K, 1M[ is trimmed to cover the range
   [512K, 768K[ and then a new file extent item for the range [768K, 2M[
   is inserted in the inode's subvolume tree;

3) The fiemap task calls fiemap_next_leaf_item(), which then calls
   btrfs_next_leaf() to find the next leaf / item. This finds that the
   the next key following the one we previously processed (its type is
   BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
   to the new file extent item inserted by the ordered extent, which has
   a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;

4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
   the warning:

      if (cache->offset + cache->len > offset) {
               WARN_ON(1);
               return -EINVAL;
      }

   Since we get 1M > 768K, because the previously emitted entry for the
   old extent covering the file range [512K, 1M[ ends at an offset that
   is greater than the new extent's start offset (768K). This makes fiemap
   fail with -EINVAL besides triggering the warning that produces a stack
   trace like the following:

     [ 1621.677651] ------------[ cut here ]------------
     [ 1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.677899] Modules linked in: btrfs blake2b_generic (...)
     [ 1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
     [ 1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
     [ 1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678033] Code: 2b 4c 89 63 (...)
     [ 1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
     [ 1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
     [ 1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
     [ 1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
     [ 1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
     [ 1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
     [ 1621.678044] FS:  00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
     [ 1621.678046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     [ 1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
     [ 1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     [ 1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     [ 1621.678056] Call Trace:
     [ 1621.678074]  <TASK>
     [ 1621.678076]  ? __warn+0x80/0x130
     [ 1621.678082]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678159]  ? report_bug+0x1f4/0x200
     [ 1621.678164]  ? handle_bug+0x42/0x70
     [ 1621.678167]  ? exc_invalid_op+0x14/0x70
     [ 1621.678170]  ? asm_exc_invalid_op+0x16/0x20
     [ 1621.678178]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678253]  extent_fiemap+0x766/0xa30 [btrfs]
     [ 1621.678339]  btrfs_fiemap+0x45/0x80 [btrfs]
     [ 1621.678420]  do_vfs_ioctl+0x1e4/0x870
     [ 1621.678431]  __x64_sys_ioctl+0x6a/0xc0
     [ 1621.678434]  do_syscall_64+0x52/0x120
     [ 1621.678445]  entry_SYSCALL_64_after_hwframe+0x6e/0x76

There's also another case where before calling btrfs_next_leaf() we are
processing a hole or a prealloc extent and we had several delalloc ranges
within that hole or prealloc extent. In that case if the ordered extents
complete before we find the next key, we may end up finding an extent item
with an offset smaller than (or equals to) the offset in cache->offset.

So fix this by changing emit_fiemap_extent() to address these two
scenarios like this:

1) For the first case, steps listed above, adjust the length of the
   previously cached extent so that it does not overlap with the current
   extent, and emit the previous one. This makes everything consistent by
   reflecting the current state of file extent items in the btree and
   without emitting extents that have overlapping ranges;

2) For the second case where he had a hole or prealloc extent with
   multiple delalloc ranges inside the hole or prealloc extent's range,
   just discard what we have in the fiemap cache and assign the current
   file extent item to the cache. This also makes everything consistent
   by reflecting the current state of file extent items in the btree and
   without emitting extents that have overlapping ranges.

This issue could be triggered often with test case generic/561, and was
also hit and reported by Wang Yugui.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 43496a07ee42..583e3f98d951 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2487,15 +2487,40 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 		goto assign;
 
 	/*
-	 * Sanity check, extent_fiemap() should have ensured that new
-	 * fiemap extent won't overlap with cached one.
-	 * Not recoverable.
+	 * When iterating the extents of the inode, at extent_fiemap(), we may
+	 * find an extent that starts at an offset behind the end offset of the
+	 * previous extent we processed. This happens if fiemap is called
+	 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
+	 * while we call btrfs_next_leaf() (through fiemap_next_leaf_item()).
 	 *
-	 * NOTE: Physical address can overlap, due to compression
+	 * For example we are in leaf X processing its last item, which is the
+	 * file extent item for file range [512K, 1M[, and after
+	 * btrfs_next_leaf() releases the path, there's an ordered extent that
+	 * completes for the file range [768K, 2M[, and that results in trimming
+	 * the file extent item so that it now corresponds to the file range
+	 * [512K, 768K[ and a new file extent item is inserted for the file
+	 * range [768K, 2M[, which may end up as the last item of leaf X or as
+	 * the first item of the next leaf - in either case btrfs_next_leaf()
+	 * will leave us with a path pointing to the new extent item, for the
+	 * file range [768K, 2M[, since that's the first key that follows the
+	 * last one we processed. So in order not to report overlapping extents
+	 * to user space, we trim the length of the previously cached extent and
+	 * emit it.
+	 *
+	 * Upon calling btrfs_next_leaf() we may also find an extent with an
+	 * offset smaller than (or equals to) cache->offset, and this happens
+	 * when we had a hole or prealloc extent with several delalloc
+	 * subranges, but after btrfs_next_leaf() released the path, delalloc
+	 * completed as well as the ordered extents and we're now with a smaller
+	 * offset. In this case just discard what we have in the cache and start
+	 * from this current extent.
 	 */
 	if (cache->offset + cache->len > offset) {
-		WARN_ON(1);
-		return -EINVAL;
+		if (offset <= cache->offset)
+			goto assign;
+
+		cache->len = offset - cache->offset;
+		goto emit;
 	}
 
 	/*
@@ -2515,6 +2540,7 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 		return 0;
 	}
 
+emit:
 	/* Not mergeable, need to submit cached one */
 	ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
 				      cache->len, cache->flags);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
  2024-02-23 23:47 ` [PATCH v3 0/2] btrfs: some fiemap fixes fdmanana
  2024-02-23 23:47   ` [PATCH v3 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
@ 2024-02-23 23:47   ` fdmanana
  1 sibling, 0 replies; 13+ messages in thread
From: fdmanana @ 2024-02-23 23:47 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.

When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 21 ++++++++-------------
 fs/btrfs/inode.c     | 22 +++++++++++++++++++++-
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 583e3f98d951..e18f566feb16 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2931,17 +2931,15 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	range_end = round_up(start + len, sectorsize);
 	prev_extent_end = range_start;
 
-	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-
 	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
 	if (ret < 0)
-		goto out_unlock;
+		goto out;
 	btrfs_release_path(path);
 
 	path->reada = READA_FORWARD;
 	ret = fiemap_search_slot(inode, path, range_start);
 	if (ret < 0) {
-		goto out_unlock;
+		goto out;
 	} else if (ret > 0) {
 		/*
 		 * No file extent item found, but we may have delalloc between
@@ -2988,7 +2986,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 						  backref_ctx, 0, 0, 0,
 						  prev_extent_end, hole_end);
 			if (ret < 0) {
-				goto out_unlock;
+				goto out;
 			} else if (ret > 0) {
 				/* fiemap_fill_next_extent() told us to stop. */
 				stopped = true;
@@ -3044,7 +3042,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 								  extent_gen,
 								  backref_ctx);
 				if (ret < 0)
-					goto out_unlock;
+					goto out;
 				else if (ret > 0)
 					flags |= FIEMAP_EXTENT_SHARED;
 			}
@@ -3055,7 +3053,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		}
 
 		if (ret < 0) {
-			goto out_unlock;
+			goto out;
 		} else if (ret > 0) {
 			/* fiemap_fill_next_extent() told us to stop. */
 			stopped = true;
@@ -3066,12 +3064,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 next_item:
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
-			goto out_unlock;
+			goto out;
 		}
 
 		ret = fiemap_next_leaf_item(inode, path);
 		if (ret < 0) {
-			goto out_unlock;
+			goto out;
 		} else if (ret > 0) {
 			/* No more file extent items for this inode. */
 			break;
@@ -3095,7 +3093,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 					  &delalloc_cached_state, backref_ctx,
 					  0, 0, 0, prev_extent_end, range_end - 1);
 		if (ret < 0)
-			goto out_unlock;
+			goto out;
 		prev_extent_end = range_end;
 	}
 
@@ -3133,9 +3131,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	}
 
 	ret = emit_last_fiemap_cache(fieinfo, &cache);
-
-out_unlock:
-	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 out:
 	free_extent_state(delalloc_cached_state);
 	btrfs_free_backref_share_ctx(backref_ctx);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d931cb40fb7a..904fff3d72f5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7865,6 +7865,7 @@ struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			u64 start, u64 len)
 {
+	struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
 	int	ret;
 
 	ret = fiemap_prep(inode, fieinfo, start, &len, 0);
@@ -7890,7 +7891,26 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			return ret;
 	}
 
-	return extent_fiemap(BTRFS_I(inode), fieinfo, start, len);
+	btrfs_inode_lock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	/*
+	 * We did an initial flush to avoid holding the inode's lock while
+	 * triggering writeback and waiting for the completion of IO and ordered
+	 * extents. Now after we locked the inode we do it again, because it's
+	 * possible a new write may have happened in between those two steps.
+	 */
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
+		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
+		if (ret) {
+			btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+			return ret;
+		}
+	}
+
+	ret = extent_fiemap(btrfs_inode, fieinfo, start, len);
+	btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	return ret;
 }
 
 static int btrfs_writepages(struct address_space *mapping,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 0/2] btrfs: some fiemap fixes
  2024-02-23  8:08 [PATCH 0/2] btrfs: some fiemap fixes fdmanana
                   ` (3 preceding siblings ...)
  2024-02-23 23:47 ` [PATCH v3 0/2] btrfs: some fiemap fixes fdmanana
@ 2024-02-25 19:51 ` fdmanana
  2024-02-25 19:51   ` [PATCH v4 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
                     ` (2 more replies)
  4 siblings, 3 replies; 13+ messages in thread
From: fdmanana @ 2024-02-25 19:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's a recent regression with fiemap due to a fix for a deadlock between
fiemap and memory mapped writes when the fiemap buffer is memory mapped to
the same file range, which leads to a race triggering a warning and making
fiemap fail. Plus one more long standing race when using FIEMAP_FLAG_SYNC.
Details in the change logs.

V4: Updated patch 1/2, added a lot more comments about that's going on,
    how each case is dealt with and why, added a missing handling for
    a delalloc case that could result in emmiting overlapping ranges.

V3: Deal with the case where offset == cache->offset which is also
    possible if we had delalloc in the range of a hole or prealloc extent.

V2: Updated patch 1/2 to deal with the case of a hole/prealloc extent
    with multiple delalloc ranges inside it.

Filipe Manana (2):
  btrfs: fix race between ordered extent completion and fiemap
  btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given

 fs/btrfs/extent_io.c | 124 ++++++++++++++++++++++++++++++++++++-------
 fs/btrfs/inode.c     |  22 +++++++-
 2 files changed, 125 insertions(+), 21 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v4 1/2] btrfs: fix race between ordered extent completion and fiemap
  2024-02-25 19:51 ` [PATCH v4 0/2] btrfs: some fiemap fixes fdmanana
@ 2024-02-25 19:51   ` fdmanana
  2024-02-25 19:51   ` [PATCH v4 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
  2024-02-26 16:58   ` [PATCH v4 0/2] btrfs: some fiemap fixes Josef Bacik
  2 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2024-02-25 19:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:

1) The fiemap task finishes processing a file extent item that covers
   the file range [512K, 1M[, and that file extent item is the last item
   in the leaf currently being processed;

2) And ordered extent for the file range [768K, 2M[, in COW mode,
   completes (btrfs_finish_one_ordered()) and the file extent item
   covering the range [512K, 1M[ is trimmed to cover the range
   [512K, 768K[ and then a new file extent item for the range [768K, 2M[
   is inserted in the inode's subvolume tree;

3) The fiemap task calls fiemap_next_leaf_item(), which then calls
   btrfs_next_leaf() to find the next leaf / item. This finds that the
   the next key following the one we previously processed (its type is
   BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
   to the new file extent item inserted by the ordered extent, which has
   a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;

4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
   the warning:

      if (cache->offset + cache->len > offset) {
               WARN_ON(1);
               return -EINVAL;
      }

   Since we get 1M > 768K, because the previously emitted entry for the
   old extent covering the file range [512K, 1M[ ends at an offset that
   is greater than the new extent's start offset (768K). This makes fiemap
   fail with -EINVAL besides triggering the warning that produces a stack
   trace like the following:

     [ 1621.677651] ------------[ cut here ]------------
     [ 1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.677899] Modules linked in: btrfs blake2b_generic (...)
     [ 1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
     [ 1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
     [ 1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678033] Code: 2b 4c 89 63 (...)
     [ 1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
     [ 1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
     [ 1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
     [ 1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
     [ 1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
     [ 1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
     [ 1621.678044] FS:  00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
     [ 1621.678046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     [ 1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
     [ 1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     [ 1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     [ 1621.678056] Call Trace:
     [ 1621.678074]  <TASK>
     [ 1621.678076]  ? __warn+0x80/0x130
     [ 1621.678082]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678159]  ? report_bug+0x1f4/0x200
     [ 1621.678164]  ? handle_bug+0x42/0x70
     [ 1621.678167]  ? exc_invalid_op+0x14/0x70
     [ 1621.678170]  ? asm_exc_invalid_op+0x16/0x20
     [ 1621.678178]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
     [ 1621.678253]  extent_fiemap+0x766/0xa30 [btrfs]
     [ 1621.678339]  btrfs_fiemap+0x45/0x80 [btrfs]
     [ 1621.678420]  do_vfs_ioctl+0x1e4/0x870
     [ 1621.678431]  __x64_sys_ioctl+0x6a/0xc0
     [ 1621.678434]  do_syscall_64+0x52/0x120
     [ 1621.678445]  entry_SYSCALL_64_after_hwframe+0x6e/0x76

There's also another case where before calling btrfs_next_leaf() we are
processing a hole or a prealloc extent and we had several delalloc ranges
within that hole or prealloc extent. In that case if the ordered extents
complete before we find the next key, we may end up finding an extent item
with an offset smaller than (or equals to) the offset in cache->offset.

So fix this by changing emit_fiemap_extent() to address these three
scenarios like this:

1) For the first case, steps listed above, adjust the length of the
   previously cached extent so that it does not overlap with the current
   extent, emit the previous one and cache the current file extent item;

2) For the second case where he had a hole or prealloc extent with
   multiple delalloc ranges inside the hole or prealloc extent's range,
   and the current file extent item has an offset that matches the offset
   in the fiemap cache, just discard what we have in the fiemap cache and
   assign the current file extent item to the cache, since it's more up
   to date;

3) For the third case where he had a hole or prealloc extent with
   multiple delalloc ranges inside the hole or prealloc extent's range
   and the offset of the file extent item we just found is smaller than
   what we have in the cache, just skip the current file extent item
   if its range end at or behind the cached extent's end, because we may
   have emitted (to the fiemap user space buffer) delalloc ranges that
   overlap with the current file extent item's range. If the file extent
   item's range goes beyond the end offset of the cached extent, just
   emit the cached extent and cache a subrange of the file extent item,
   that goes from the end offset of the cached extent to the end offset
   of the file extent item.

Dealing with those cases in those ways makes everything consistent by
reflecting the current state of file extent items in the btree and
without emitting extents that have overlapping ranges (which would be
confusing and violating expectations).

This issue could be triggered often with test case generic/561, and was
also hit and reported by Wang Yugui.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 103 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 96 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 43496a07ee42..4a839851905f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2478,6 +2478,7 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 				struct fiemap_cache *cache,
 				u64 offset, u64 phys, u64 len, u32 flags)
 {
+	u64 cache_end;
 	int ret = 0;
 
 	/* Set at the end of extent_fiemap(). */
@@ -2487,15 +2488,102 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 		goto assign;
 
 	/*
-	 * Sanity check, extent_fiemap() should have ensured that new
-	 * fiemap extent won't overlap with cached one.
-	 * Not recoverable.
+	 * When iterating the extents of the inode, at extent_fiemap(), we may
+	 * find an extent that starts at an offset behind the end offset of the
+	 * previous extent we processed. This happens if fiemap is called
+	 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
+	 * while we call btrfs_next_leaf() (through fiemap_next_leaf_item()).
 	 *
-	 * NOTE: Physical address can overlap, due to compression
+	 * For example we are in leaf X processing its last item, which is the
+	 * file extent item for file range [512K, 1M[, and after
+	 * btrfs_next_leaf() releases the path, there's an ordered extent that
+	 * completes for the file range [768K, 2M[, and that results in trimming
+	 * the file extent item so that it now corresponds to the file range
+	 * [512K, 768K[ and a new file extent item is inserted for the file
+	 * range [768K, 2M[, which may end up as the last item of leaf X or as
+	 * the first item of the next leaf - in either case btrfs_next_leaf()
+	 * will leave us with a path pointing to the new extent item, for the
+	 * file range [768K, 2M[, since that's the first key that follows the
+	 * last one we processed. So in order not to report overlapping extents
+	 * to user space, we trim the length of the previously cached extent and
+	 * emit it.
+	 *
+	 * Upon calling btrfs_next_leaf() we may also find an extent with an
+	 * offset smaller than or equals to cache->offset, and this happens
+	 * when we had a hole or prealloc extent with several delalloc ranges in
+	 * it, but after btrfs_next_leaf() released the path, delalloc was
+	 * flushed and the resulting ordered extents were completed, so we can
+	 * now have found a file extent item for an offset that is smaller than
+	 * or equals to what we have in cache->offset. We deal with this as
+	 * described below.
 	 */
-	if (cache->offset + cache->len > offset) {
-		WARN_ON(1);
-		return -EINVAL;
+	cache_end = cache->offset + cache->len;
+	if (cache_end > offset) {
+		if (offset == cache->offset) {
+			/*
+			 * We cached a dealloc range (found in the io tree) for
+			 * a hole or prealloc extent and we have now found a
+			 * file extent item for the same offset. What we have
+			 * now is more recent and up to date, so discard what
+			 * we had in the cache and use what we have just found.
+			 */
+			goto assign;
+		} else if (offset > cache->offset) {
+			/*
+			 * The extent range we previously found ends after the
+			 * offset of the file extent item we found and that
+			 * offset falls somewhere in the middle of that previous
+			 * extent range. So adjust the range we previously found
+			 * to end at the offset of the file extent item we have
+			 * just found, since this extent is more up to date.
+			 * Emit that adjusted range and cache the file extent
+			 * item we have just found. This corresponds to the case
+			 * where a previously found file extent item was split
+			 * due to an ordered extent completing.
+			 */
+			cache->len = offset - cache->offset;
+			goto emit;
+		} else {
+			const u64 range_end = offset + len;
+
+			/*
+			 * The offset of the file extent item we have just found
+			 * is behind the cached offset. This means we were
+			 * processing a hole or prealloc extent for which we
+			 * have found delalloc ranges (in the io tree), so what
+			 * we have in the cache is the last delalloc range we
+			 * found while the file extent item we found can be
+			 * either for a whole delalloc range we previously
+			 * emmitted or only a part of that range.
+			 *
+			 * We have two cases here:
+			 *
+			 * 1) The file extent item's range ends at or behind the
+			 *    cached extent's end. In this case just ignore the
+			 *    current file extent item because we don't want to
+			 *    overlap with previous ranges that may have been
+			 *    emmitted already;
+			 *
+			 * 2) The file extent item starts behind the currently
+			 *    cached extent but its end offset goes beyond the
+			 *    end offset of the cached extent. We don't want to
+			 *    overlap with a previous range that may have been
+			 *    emmitted already, so we emit the currently cached
+			 *    extent and then partially store the current file
+			 *    extent item's range in the cache, for the subrange
+			 *    going the cached extent's end to the end of the
+			 *    file extent item.
+			 */
+			if (range_end <= cache_end)
+				return 0;
+
+			if (!(flags & (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DELALLOC)))
+				phys += cache_end - offset;
+
+			offset = cache_end;
+			len = range_end - cache_end;
+			goto emit;
+		}
 	}
 
 	/*
@@ -2515,6 +2603,7 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 		return 0;
 	}
 
+emit:
 	/* Not mergeable, need to submit cached one */
 	ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
 				      cache->len, cache->flags);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
  2024-02-25 19:51 ` [PATCH v4 0/2] btrfs: some fiemap fixes fdmanana
  2024-02-25 19:51   ` [PATCH v4 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
@ 2024-02-25 19:51   ` fdmanana
  2024-02-26 16:58   ` [PATCH v4 0/2] btrfs: some fiemap fixes Josef Bacik
  2 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2024-02-25 19:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.

When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 21 ++++++++-------------
 fs/btrfs/inode.c     | 22 +++++++++++++++++++++-
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4a839851905f..86177d93eecc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2994,17 +2994,15 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	range_end = round_up(start + len, sectorsize);
 	prev_extent_end = range_start;
 
-	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-
 	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
 	if (ret < 0)
-		goto out_unlock;
+		goto out;
 	btrfs_release_path(path);
 
 	path->reada = READA_FORWARD;
 	ret = fiemap_search_slot(inode, path, range_start);
 	if (ret < 0) {
-		goto out_unlock;
+		goto out;
 	} else if (ret > 0) {
 		/*
 		 * No file extent item found, but we may have delalloc between
@@ -3051,7 +3049,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 						  backref_ctx, 0, 0, 0,
 						  prev_extent_end, hole_end);
 			if (ret < 0) {
-				goto out_unlock;
+				goto out;
 			} else if (ret > 0) {
 				/* fiemap_fill_next_extent() told us to stop. */
 				stopped = true;
@@ -3107,7 +3105,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 								  extent_gen,
 								  backref_ctx);
 				if (ret < 0)
-					goto out_unlock;
+					goto out;
 				else if (ret > 0)
 					flags |= FIEMAP_EXTENT_SHARED;
 			}
@@ -3118,7 +3116,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		}
 
 		if (ret < 0) {
-			goto out_unlock;
+			goto out;
 		} else if (ret > 0) {
 			/* fiemap_fill_next_extent() told us to stop. */
 			stopped = true;
@@ -3129,12 +3127,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 next_item:
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
-			goto out_unlock;
+			goto out;
 		}
 
 		ret = fiemap_next_leaf_item(inode, path);
 		if (ret < 0) {
-			goto out_unlock;
+			goto out;
 		} else if (ret > 0) {
 			/* No more file extent items for this inode. */
 			break;
@@ -3158,7 +3156,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 					  &delalloc_cached_state, backref_ctx,
 					  0, 0, 0, prev_extent_end, range_end - 1);
 		if (ret < 0)
-			goto out_unlock;
+			goto out;
 		prev_extent_end = range_end;
 	}
 
@@ -3196,9 +3194,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	}
 
 	ret = emit_last_fiemap_cache(fieinfo, &cache);
-
-out_unlock:
-	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 out:
 	free_extent_state(delalloc_cached_state);
 	btrfs_free_backref_share_ctx(backref_ctx);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d931cb40fb7a..904fff3d72f5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7865,6 +7865,7 @@ struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			u64 start, u64 len)
 {
+	struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
 	int	ret;
 
 	ret = fiemap_prep(inode, fieinfo, start, &len, 0);
@@ -7890,7 +7891,26 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			return ret;
 	}
 
-	return extent_fiemap(BTRFS_I(inode), fieinfo, start, len);
+	btrfs_inode_lock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	/*
+	 * We did an initial flush to avoid holding the inode's lock while
+	 * triggering writeback and waiting for the completion of IO and ordered
+	 * extents. Now after we locked the inode we do it again, because it's
+	 * possible a new write may have happened in between those two steps.
+	 */
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
+		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
+		if (ret) {
+			btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+			return ret;
+		}
+	}
+
+	ret = extent_fiemap(btrfs_inode, fieinfo, start, len);
+	btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	return ret;
 }
 
 static int btrfs_writepages(struct address_space *mapping,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 0/2] btrfs: some fiemap fixes
  2024-02-25 19:51 ` [PATCH v4 0/2] btrfs: some fiemap fixes fdmanana
  2024-02-25 19:51   ` [PATCH v4 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
  2024-02-25 19:51   ` [PATCH v4 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
@ 2024-02-26 16:58   ` Josef Bacik
  2 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2024-02-26 16:58 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Sun, Feb 25, 2024 at 07:51:23PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> There's a recent regression with fiemap due to a fix for a deadlock between
> fiemap and memory mapped writes when the fiemap buffer is memory mapped to
> the same file range, which leads to a race triggering a warning and making
> fiemap fail. Plus one more long standing race when using FIEMAP_FLAG_SYNC.
> Details in the change logs.
> 
> V4: Updated patch 1/2, added a lot more comments about that's going on,
>     how each case is dealt with and why, added a missing handling for
>     a delalloc case that could result in emmiting overlapping ranges.
> 
> V3: Deal with the case where offset == cache->offset which is also
>     possible if we had delalloc in the range of a hole or prealloc extent.
> 
> V2: Updated patch 1/2 to deal with the case of a hole/prealloc extent
>     with multiple delalloc ranges inside it.
> 
> Filipe Manana (2):
>   btrfs: fix race between ordered extent completion and fiemap
>   btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
> 

Eesh sorry about that, I must have missed the other test failure when I ran my
fix through CI.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-02-26 16:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-23  8:08 [PATCH 0/2] btrfs: some fiemap fixes fdmanana
2024-02-23  8:08 ` [PATCH 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
2024-02-23  8:08 ` [PATCH 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
2024-02-23 15:19 ` [PATCH v2 0/2] btrfs: some fiemap fixes fdmanana
2024-02-23 15:19   ` [PATCH v2 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
2024-02-23 15:19   ` [PATCH v2 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
2024-02-23 23:47 ` [PATCH v3 0/2] btrfs: some fiemap fixes fdmanana
2024-02-23 23:47   ` [PATCH v3 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
2024-02-23 23:47   ` [PATCH v3 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
2024-02-25 19:51 ` [PATCH v4 0/2] btrfs: some fiemap fixes fdmanana
2024-02-25 19:51   ` [PATCH v4 1/2] btrfs: fix race between ordered extent completion and fiemap fdmanana
2024-02-25 19:51   ` [PATCH v4 2/2] btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given fdmanana
2024-02-26 16:58   ` [PATCH v4 0/2] btrfs: some fiemap fixes Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.